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Preface 



International Federation of Classification Societies 

The International Federation of Classification Societies (IFCS) is an agency for 
the dissemination of technical and scientific information concerning classification 
and multivariate data analysis in the broad sense and in as wide a range of 
applications as possible; founded in 1985 in Cambridge (UK) by the following 
Scientific Societies and Groups: 

- British Classification Society - BCS 

- Classification Society of North America - CSNA 

- Gesellschaft fur Klassification - GfKl 

- Japanese Classification Society - JCS 

- Classification Group of Italian Statistical Society - CGSIS 

- Societe Francophone de Classification - SFC 

Now the IFCS includes also the following Societies: 

- Dutch-Belgian Classification Society - VOC 

- Polish Classification Section - SKAD 

- Portuguese Classification Association - CLAD 

- Group at Large 

- Korean Classification Society - KCS 

IFCS-98, the Sixth Conference of the International Federation of Classification 
Societies, was held in Rome, from July 21 to 24, 1998. 

Five preceding conferences were held in Aachen (Germany), Charlottesville 
(USA), Edinburgh (UK), Paris (France), Kobe (Japan). 



Conference Organization 

The scientific program of the Conference included 67 invited papers and 129 
contributed papers, that discussed recent developments in the following topics: 

• Classification Theory 

• Multivariate Data Analysis 

• Multiway Data Analysis 

• Proximity Structure Analysis 

• Software Developments for Classification 

• Applied Classification and Data Analysis in Social, Economic, Medical and 
other Sciences 
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The novelty of IFCS-98 is the publication of the papers before the Conference 
since new electronic media allow a quick diffusion of the most recent 
developments of the topics discussed. 

This volume, "'Advances in Data Science and Classification"", contains 39 
invited and 52 contributed referred papers. 

They are presented in six chapters as follows: 

• Methodologies in Classification 

- Clustering and Classification 

- Comparison and Consensus of Classifications 

- Fuzzy Clustering and Fuzzy Methods 

- Optimization in Classification and Constrained Classification 

- Probabilistic Modelling Methods in Classification and Pattern 
Recognition 

• Other Approaches for Classification: Discriminant, Neural Network, 
Regression Tree 

- Discrimination and Classification 

- Neural Network and Classification 

- Regression Tree 

• Factorial Methods 

- Factorial Methods for Multivariate Data 

• Symbolic and Textual Data Analysis 

- Symbolic Data Analysis 

- Textual Data Analysis 

• Multivariate and Multidimensional Data Analysis 

- Proximity Analysis and Multidimensional Scaling 

- Multivariate Data Analysis 

- Non-Linear Data Analysis 

- Multiway Data Analysis 

• Case Studies 

- Applied Classification and Data Analysis 

The volume provides new developments in classification and data analysis and 
presents new topics which are of central interest to modem statistics. For most 
topics considered, this book provides a systematic state-of-the-art account written 
by some of the top researchers in the world. We hope this volume will serve as a 
helpful introduction to the area of classification and data analysis and contribute 
significantly to the quick transfer of new advances in data science and 
classification. 
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K-midranges clustering 

J. Douglas Carroll 
Rutgers University 
Faculty of Management 
Management Education Center # 125 
8 1 New Street 
Newark NJ 07102 

Anil Chaturvedi 
AT&T Laboratories 
Room 5c- 134 
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Abstract: We present a new clustering procedure called K-midranges clustering. 
K-midranges is analogous to the traditional K-Means procedure for clustering 
interval scale data. The K-midranges procedure explicitly optimizes a loss 
function based on the L» norm (defined as the limit of an Lp norm as p approaches 
infinity). 

Keywords: Continuous data. Cluster analysis. Groups, Midrange, K-means 



1. Introduction 

In this paper, we present a clustering procedure called K-midranges, which is 
analogous to K-means or K-medians clustering, but explicitly optimizes an L..- 
loss function. The K-midranges procedure is as fast as K-means clustering, can 
handle large data set sizes and is computationally tractable. We also find it yields 
interpretable clusters for a marketing data set. 

In the following sections, we describe the general bilinear clustering model, and 
show that when model parameters are estimated using an Lc-norm-based loss 
function with partitioning constraints on the clusters, this results in the K- 
midranges clustering procedure. 



2. The General Bilinear Clustering Model 

Assume that we have data on N continuous variables for M objects (or subjects). 
Let K be the number of clusters being sought. Then, the general bilinear model 
can be written as: 
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where 



Cmxn = SmxkWkxn + error 



( 1 ) 



-C is a subjects x variables data matrix, 

-S is a binary indicator matrix for membership of the M subjects in K (possibly 
overlapping) clusters. In the present case, we constrain the model to yield 
mutually exclusive, non-overlapping clusters however (so that each row of S has 
exactly one element equal to one and the remaining elements are equal to zero), 
and 

-W is a matrix of "generalized centroids", defined in this case as mid-ranges of 
certain observations. 



It should be noted that only C, the data matrix, is known in (1), while both S and 
W are unknown and need to be estimated. This general bilinear model has been 
used in the past for determining overlapping clusters for interval scale data (where 
$bold C$ is continuous) by Mirkin (1990) and Chaturvedi, Carroll, Green, and 
Rotondo (1997), and is a special case of the CANDCLUS approach developed by 
Carroll and Chaturvedi (1995). In this paper, we assume that the data matrix C 
consists of interval scale valued elements, while S will be constrained, as stated 
earlier, to define a partition. 



3. Parameter Estimation via an Loo norm 



As in Chaturvedi, Carroll, Green, and Rotondo (1997), we define the parameter 
estimation problem via minimizing an Lp-norm based loss function 



Lp = 



M N 

XSbmn -C„ 



P 






where is the (m,n)"' element of $ bold C hat = SW, to estimate S and W, for 
positive values of p > «>. 



In this paper, we concentrate on estimating the model in equation (1) using the L„ 
loss function, while S is constrained to be a partitioning matrix. The U. norm, 
defined as the limit of the Lp norm as p approaches infinity, can be shown to be: 

Lp = max(jc^ ~c^|), m = 1, M, n = 1, N. It is often called the "max metric" or 
the "dominance metric". 
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4. Estimation Procedure 

The matrices S and W are estimated in an iterative fashion (estimating S given 
estimates of W, then revising the estimates of W given the new estimates of S) 
until the Lo loss function does not improve. The procedures for estimating S and 
W are given below: 

To estimate S, the cluster membership, given estimates of W, consider the 
following illustrative case by first defining 

1 5 0 3 Sjj Sj2 

2 6 13 S21 S22 

3 6 0 3 S3, S32 - [2 6 1 3 

0 6 12 S41 S42 [1 5 0 1_ 

15 0 1 S 51 S 52 

2 4 12 ^ _Sgj Sg2_ 



The matrices C and are assumed known, while a current (conditional) estimate 
of S is sought. The (i,j)^ entry of matrix W corresponds to the continuous value 
defining the centroid for the cluster and the continuous variable. In the 
general case, we wish to find the Loo-norm based estimates of S, where 

C = SW + error and elements of S are either 0 or 1. If we let 
fi = max(jl-2sii -lsi2| + |5-6sii -5si2| + l0-ls^ -0si2| + |3-3sn -lsi2|)» 
f2 =max(|2-2s2i -IS22I + I6-6S21 -5s22| + |l~ls2i -“0s22| + |3~3s2i ~ls22l)» 
f 3 =max(j3“2s3i -IS 32 I + I 6 -- 6 S 31 ~5s32| + | 0 “ls 3 i -0s32| + |3-3s3i ~ls32|), 
f 4 =max(j 0 “ 2 s 4 i “IS42I + I6-6S41 -5S42I + II-IS41 - 0 s 42 | + | 2 - 3 s 4 i -IS42I), 

fj = max(jl-2s5i -IS52I + I5-6S51 “-5s52| + |0-ls5i -“0s52| + |l-3s5i “ls52|)l and 
fe =max(|2-2s6, -Is62| + |4-6s6, -Ssg^l + ll-lSs, -0ss2| + |2-3s6, -Isgji), 
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then the total mismatch (Loo loss function) is given by F = max(fi, fj, fs, f4, fs, fe)* 
Note that fi is a function only of Sn and S12; fi is a function only of S21 and S22; 
fs is a function only of S31 and S32;and U is a function only of S41 and S42; Thus, F is 
separable with respect to parameters for each row of S. To minimize F, one can 
separately minimize fi w.r.t parameters for row 1, f2 w.r.t parameters for row 2, 
etc. Certainly, if fi through f m are minimized, then F = max(f m)» (m = 1 , M), 
will be minimized.^ 

To minimize, say, fi w.r.t. parameters for row 1 , [sn S12] with partitioning 
constraints, one evaluates fi explicitly at the two permissible values of row 1, [1 
0 ] and [0 1 ], given the constraint to a partitioning solution. For the pattern [1 0 ], 
fi is obtained by computing the maximum of absolute differences between the 
data value and the predicted value, computed for each element in the row. Thus, 
for the pattern [10], the absolute differences for the four elements are 1, 1,1, and 
0 , resulting in a maximum absolute difference of 1 . For the pattern [ 0 , 1 ], the 
element-wise differences are 0 , 0 , 0 , and 2 . This results in a maximum difference of 
2 for the pattern [0 1 ]. The winning pattern is the pattern that results in the 
smallest maximum difference. Thus, in this case, the winning pattern is [1 0 ]. 
Thus, 

Sj^ = 1 and Sj2 =0 are the optimal estimates. The other rows of S, the current 
estimate of S, can be determined using a similar procedure. 

We note that since S induces a partition on the rows of C, that rows of W 
correspond to "generalized centroids" of the associated clusters. Given that we are 
minimizing the Loo loss function, these "generalized centroids" will be vectors of 
midranges since it can easily be seen that the measure of central tendency 
minimizing the Loo-norm based loss function is the midrange (mean of the two 
most extreme values). We can use this to advantage in estimating W, given C and 
conditional estimates of S. To illustrate, using the previous example, to estimate 
W (the cluster "generalized centroids"), given estimates of S, we first define: 





’1 


5 


0 


3 ] 




"1 


ol 








2 


6 


1 


3 




1 


0 








3 


6 


0 


3 




1 


0 


and W = 


'wu w,2 W ,3 Wi 4 


c = 










S = 








0 


6 


1 


2 




0 


1 




_Wji Wj2 W23 Wj 4 _ 




1 


5 


0 


1 




0 


1 








2 


4 


1 


2 




0 


1 







^ While, given the definition of F as the max ( fm ), (m = 1 , M), we should minimize F by simply 
minimizing the largest of the fm's, we resolve the implied indeterminacy by choosing the solution 
that separately minimizes each of the fm's. 




7 



The matrices C and S are assumed known, while a current (conditional) estimate 
of W is sought. In order to determine the Loo estimate of W, where 
C = SW + error and W is continuous, we can show that the best estimate of wn 
that optimizes the Loo loss function is the mid-range of (1,2,3) = 2, and hence, w^^ 
= 2. Similarly, w ^2 will be the mid-range of (5,6,6) = 5.5, ..., and 

mid-range of (4,4,2) = 3, Thus, W , the current estimate of W, is a matrix of 
midranges of the variables for each cluster, and hence, we call this procedure a K- 
midranges clustering procedure. We break ties arbitrarily in the current version of 
the software. We repeat the estimation of S given W , and W given S until the 
Loo loss function does not improve. It should be noted that upon convergence, this 
procedure can yield locally optimal solutions (as can K-means, K-medians, K- 
modes, and other related methods). The best strategy when using K-midranges 
with real data is to use multiple random starting seeds, and choose the solution 
with the lowest Loo-norm measure. 



Actually, in this case of non-overlapping clusters (i.e., a partition), the K- 
midranges procedure can be described in much simpler terms, closely analogous 
to the standard description of K-means, K-medians, or K-modes. Namely, we 
alternate between defining cluster "generalized centroids" (in this case midranges) 
for an existing partition, into K non-overlapping clusters, by assigning each object 
(m) to that cluster for which the maximum deviation over n between Cmn and the 
midrange for that cluster is minimized; i.e., we assign each object to the cluster 
whose “generalized centroid” it is closest to in the sense of minimizing the Loo 
metric. This is equivalent to the K-means, K-medians, or K-modes algorithms, but 
with L 2 , Li, or Lo norms, respectively, replaced by the Loo norm, and means, 
medians, or modes by midranges. 



As with K-means, K-medians, and K-modes, the (non-overlapping) K-midranges 
approach to clustering can be generalized to overlapping K-midranges - but we 
shall not consider that further in this paper (except to say that this relies on the 
same "one cluster at a time" type of estimation procedure discussed by 
Chaturvedi, Carroll, Green, and Rotondo, 1997, in their paper describing 
overlapping K-means, and K-medians approaches, and discussing other 
“overlapping K-centroids” approaches based on optimizing different Lp-norm- 
based loss functions). 



5. Application of K-midranges to Market Segmentation 

The K-midranges procedure was applied to segment the mail order insurance 
market based on data obtained from a conjoint analysis study. Six hundred 
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potential buyers participated in a study designed to evaluate new financial service 
concepts for the mail order insurance marketin the southwest U.S. Nine attributes 

- price of the basic service, credit management, credit recovery, assisted savings 
and investment, budgeting and control of discretionary spending, basic insurance 
services, general advice/information, bill payment, and service provision method 

- were used in a conjoint design. Data were collected from 600 potential 
purchasers. 



Conjoint analysis was applied to derive importances of the nine financial services 
attributes for each of the 600 subjects. The derived attribute importances were 
then used in a subsequent market segmentation analysis. The basic data consisted 
of a 600 X 9 (subjects by variables) matrix of derived importances. The derived 
importances were normalized across the nine attributes for each subject such that 
the row sum for each subject was 1.0. 



K-means clustering 

K-means cluster analysis was used to find clusters in the marketplace. Different 
initial seeds were used for the K-means algorithm. Solutions from Ward's 
minimum-variance algorithm were also used as starting seeds for the K-means 
algorithm. A 5-cluster K-means solution, obtained from random starting seeds, 
was chosen on the grounds of highest VAF ( Variance- Accounted-For) and 
interpretability. The details of the 5-cluster K-means solution are given in Table 1. 



The 5-cluster, traditional K-means solution yielded a VAF value of 50.6%. The 
five segments were interpreted as: 



1. Price only: The mean importance of price was the highest for this segment 
(0.60). Price is the only major benefit that is sought by this segment. 

2. Credit management & control: This segment has the highest mean 
importances for credit management, credit recovery, control of spending, and 
provision method. 
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Table 1: 5-cluster K-means solution: Entries are cluster means for variables 



Attributes 


Price 


Credit Mgt. 
And Control 


General 

Services 


Savings 


Price and 

Convenience 


Basic Price 


0.60 


0.09 


0.09 


0.07 


0.28 


Credit Mgt 


0.07 


0.25 


0.14 


0.15 


0.15 


Credit Recovery 


0.03 


0.12 


0.03 


0.02 


0.03 


S&I 


0.02 


0.06 


0.07 


0.31 


0.06 


Control 


0.04 


0.13 


0.10 


0.08 


0.08 


Spending 

Insurance 


0.02 


0.05 


0.11 


0.06 


0.06 


General Advice 


0.03 


0.06 


0.12 


0.11 


0.06 


Bill Payment 


0.07 


0.10 


0.18 


0.09 


0.10 


Provision 


0.12 


0.13 


0.18 


0.12 


0.18 


Method 
Segment Size 


32 


181 


187 


62 


138 



Table 2: Goodness of Fit for the K-midranges procedure 



# Clusters 


Loo^ Measure 


Normalized 
Loo* Measure 


1 


0.48 


0.50 


2 


0.47 


0.49 


3 


0.35 


0.37 


4 


0.35 


0.36 


5 


0.34 


0.36 


6 


0.31 


0.33 


7 


0.29 


0.30 


8 


0.29 


0.30 


9 


0.27 


0.28 


10 


0.24 


0.25 


11 


0.23 


0.24 


12 


0.20 


0.21 


13 


0.18 


0.19 


14 


0.18 


0.19 



+ The Loo measure reported here is the maximum of the absolute differences between each elemait 
of the data matrix and the corresponding value predicted by the K-midranges procedure. 

* The normalized Loo measure is obtained by dividing the Lo» measure by the range of the input 
data. 

3. General services: The differentially most important attributes characterizing 
this segment are insurance, general advice, bill payment, and provision 
method. 
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4. Savings & investment oriented: This segment has the highest importance for 
the savings and investment (S&I) attribute. 

5. Price and convenience: This segment places a high importance on price and 
provision method. 



K-midranges clustering 

We also applied the K-midranges procedure on the same data set. Table 2 presents 
the goodness-of-fit measure (the Loo and the normalized Loo norms) for one 
through fourteen clusters. Table 2 suggests that we extract 3 clusters, since the 
normalized Loo measure has the steepest fall (about 12%) in going from 2 to 3 
clusters. The 3-cluster solution is reported in Table 3. 

The sizes of the three clusters have a large variance (proportion of observations 
ranging from 80% to 2%), and the largest cluster is not quite interpretable. Cluster 
1 corresponds to the "Price only" cluster, where the midrange for the price 
attribute has a very high value of 0.60, and the other attributes have a very low 
value. This cluster contains roughly 18% of the observations. Cluster 2, the 
largest cluster, has relatively medium midranges on most attributes, except for 
price and advice/information. Cluster 3 corresponds to financial advice seekers, 
(roughly 2% of the sample), and has a very high midrange value of 0.61for the 
Advice/information attribute. Thus, the 3-cluster solution is not very useful from a 
marketing perspective. Apart from the "Price only" cluster, and the very small 
"Advice seekers" cluster, the rest of the observations were grouped into one large 
cluster. 

We decided to pick the 13-cluster solution, for reasons of interpretability. 

The thirteen cluster solution is presented in Table 4. Clusters 1 through 9 have 
small cluster sizes, and correspond to observations that have a large value on just 
one attribute, and very small values on the other attributes. Thus, Clusters 1 
through 9 can be called the "Advice extremes", "Credit recovery only", "Extreme 
Price only", "Insurance only", "Provision method only", "Advice only", "Savings 
only", "Control only", and "Credit Management only" clusters. 

Cluster 10 corresponds to the "Price only" cluster. It is similar to cluster 3, but the 
midranges are not that extreme for this cluster. Cluster 11 can be termed as the 
"Price and Convenience" cluster, since the largest two midranges correspond to 
the price and bill provision attributes. It should be noted that a cluster with a very 
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Table 3: 3-cluster K-midrange solution 



Segment 


Size 


Pr 


CM 


CR 


SI 


Ctrl 


Ins 


Adv 


BP 


PM 


Price 

only 


105 


0.60 


0.27 


0.12 


0.26 


0.22 




0.08 


0.14 


0.20 


Majority 


479 


ma 


0.31 


0.34 


0.35 


0.24 




0.14 


0.24 


0.35 


Advice 

seekers 






0.10 


0.06 


0.18 


0.10 




0.61 




0.12 



Table 4: 13-cluster K-midrange solution 



Segment 


Size 


Pr 


CM 


CR 


mm 




Ins 




inio 




1 


2 




0.01 




liW 




lilM 






liliW 


2 


8 


0.08 


0.09 








liIiM 








3 


10 






0.02 


0.01 




BliW 


lilCT 


09 




4 


11 


0.14 


0.13 


0.03 


0.07 


0.08 


0.44 


0.05 


09 


0.18 


5 


11 






0.05 


0.10 








lilHI 


0.55 


6 


12 


0.07 






0.18 


0.10 






0.08 


0.12 


7 








0.00 


0.53 


0.08 








0.10 


8 


mm 








0.10 






liUI 


0.11 


09 


9 


wm 


0.18 


0.45 


ME| 






03 


lilW 




03 


10 


67 






0.12 


0.09 


0.08 


0.10 


0.08 


0.11 


0.14 


11 






0.17 


0.08 


0.17 


0.17 


0.12 


0.12 


0.11 


0.23 


12 


120 


KUI 


lilEI 


0.11 




03 


09 


0.13 


0.29 




J3 


159 


lilHI 


03 




MiWHI 


03 


liiiai 




0.17 


03 



Notation for Tables 3 and 4: 

Pr -Basic Price 

CM - Credit Management 

CR - Credit Recovery 

SI - Savings and Investmait 

Ctrl -Control over Discretionary spending 

Ins - Insurance 

Adv - Advice/Information 

BP - Bill Payment; 

PM - Provision Method 

Entries are segment centroids for variables 



similar interpretation had been extracted using the K-means procedure (see Table 
1). In fact, 48 observations from the ’’Price and Convenience” cluster from the K- 
means solution lie in cluster 11. Cluster 12 corresponds to the ’’Convenience” 
cluster, since the largest midrange values occur for the Bill payment and Provision 
method attributes, both of which are aimed at enhancing consumer convenience. 
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Cluster 13 can be termed the general services cluster, where-in most attributes are 
equally important to the consumers in that cluster. 

Thus, of the thirteen clusters extracted, we see that the nine smallest clusters are 
based on large values on just one of the nine variables, while the four largest 
clusters are based on high attribute importance weights for multiple attributes. 
From a marketing standpoint, the four largest clusters would be the most attractive 
and possibly profitable not only because of their larger size, but also because they 
would enable the management of the concerned financial services firm to market a 
mixture of financial services (e.g., mixture of bill payment service and telephone 
banking service for cluster 11, since cluster 11 has the highest midranges for these 
two variables, thereby providing the benefit of convenience to the subjects in that 
cluster) to consumers in these clusters, thereby potentially increasing their 
revenues, and hence profits, from these clusters. 



It would appear that the midranges may be particularly good for isolating outliers; 
the first nine clusters in this case could be viewed as outlying clusters, each based 
on extreme values on a single attribute (or variable). Whether the outlying clusters 
are substantively interesting or not is debatable. Each of these nine clusters appear 
to be associated uniquely with one of the nine variables, and so could be viewed 
as analogous to "specific factors" in factor analysis. The four larger clusters, in 
this view, can be interpreted as representing "common structure" remaining after 
these outliers are removed from the data. In fact - without pressing the analogy 
too far - this "common structure" is strongly reminiscent of a "simple structure" 
configuration in factor analysis! 



In terms of cluster membership, the degree of correspondence between the 13- 
cluster K-midrange solution and the 5-cluster K-means solution, conputed using 
the modified RAND index measure of Hubert and Arabie (1985) is 0.154. We 
also compared the 5-cluster K-means solution with the four largest clusters 
derived from the 13-cluster K- midranges clustering procedure. The modified 
RAND index was only 0.170, indicating a very low correspondence between the 
5-cluster K-means solution and the four largest clusters derived from the 13- 
cluster K-midranges solution. Thus, the K-midranges procedure and the K-means 
procedures offer very different solutions in terms of segment membership for this 
data set. 



6. Conclusions 

We have presented an approach to cluster analysis via K-midranges clustering. It 
is analogous to the K-means (MacQueen 1967), K-medians (Vinod 1969; 




13 



Rousseeuw 1987), and K- modes (Chaturvedi, Green, and Carroll 1996) clustering 
procedures. The proposed approach can handle data of relatively large size (e.g., 



up to 5,000 consumers or other entities) and, hence, can be used fruitfully in 
social and behavioral sciences. Though we presented an empirical illustration of 
the K-midranges procedure, we feel more empirical research is needed to assess 
the longer term potential of K-midranges clustering. 
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Abstract: In this paper we propose a new approach to the exploratory 
analysis of multivariate clustered data. Our technique is based on a fast 
forward search algorithm which orders multivariate observations from 
those most in agreement with a specified clustering structure to those 
least in agreement with it. Simple graphical displays of a variety of 
statistics involved in the forward search lead to the identification of 
multiple outliers and influential observations in nonhierarchical cluster 
analysis, without being affected by masking and swamping problems. 
The suggested approach is applied to the convergent K-means method in 
two examples, both with real and simulated data. 

Keywords: cluster validity; forward search; K-means; masking; 

multivariate outliers. 



1. Introduction 

The aim of influence detection is to identify those data points that have 
a large impact on the partitions obtained from a clustering algorithm. 
This purpose is especially relevant to the areas of clustering validation 
and stability (see, e.g., Milligan, 1996, and Gordon, 1996). Recent 
contributions in the field of influence detection include Jolliffe et al 
(1995) and Cheng and Milligan (1996), which explicitly consider only 
hierarchical methods. An extension to nonhierarchical clustering, for 
variable number of clusters, has been suggested by Cerioli (1997). These 
techniques are based on the comparison of a reference partition, 
computed on the complete data set through a specified clustering 
algorithm, and a modified partition, obtained from a reduced data set 
using the same algorithm. The reduced data set is produced by simply 
deleting a single case from the complete one. However, it is well known 
that single-case deletion diagnostics can suffer from the problems of 
masking and swamping^ when multiple outliers are present in the data 
(Barnett and Lewis, 1994). 

At the opposite end of the spectrum from the “traditional plus 
diagnostics” approach, Cuest a- Albertos, Gordaliza and Matran (1997) 
have proposed a high- breakdown procedure with the aim of robustifying 
the K-means clustering method. Despite its mathematical elegance, a 
disadvantage of this procedure is that it can prove difficult to apply in 
practical situations with a large number of variables. Furthermore very 
precise estimates of cluster means may not be necessary when interest 
lies in identifying multivariate outliers. The purpose of the present paper 
is to propose a simple and fast technique which leads to the detection of 
multiple outliers and influential observations in nonhierarchical 
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clustering, without being affected by masking and swamping problems. 
Specifically, our approach is based on a forward search algorithm which 
orders the observations from those most in agreement with a specified 
clustering structure to those least in agreement with it. 



2. The forward search for clustered data 

Forward search procedures for the identification of multiple outliers have 
been recently developed to deal with unstructured multivariate data 
(Hadi, 1992; Atkinson, 1994) and grouped data with known group 
structure (Atkinson and Riani, 1997). The present proposal extends this 
field of research to the case of multivariate clustered data with unknown 
clusters. The focus here is on the A"-means method, although the 
suggested technique can be conveniently applied to any nonhierarchical 
clustering algorithm. 

Suppose that the data matrix consists of an nxp set of multivariate 
observations. Let x- be the observation corresponding to unit i. The 
main steps of our algorithm can be summarized as follows. 

Stej) 1. Given a fixed number of clusters, say AT, find K “robust” 
centroids. These are intended to be the centroids of clusters computed 
from outlier-free observations. Our method for performing this step is 
explained below, after the algorithm description. Take the unit which is 
closest to the A;-th robust centroid as the initial seed of cluster k (k = 1, 

Step 2. Define the initial centroid of cluster k as its initial seed. Denote 
initial centroids by cj, ..., Compute nearest-centroid distances as 

di = min [d(Xi, cl)] i = 1, ..., n, 

where d(x, y) is the (euclidean) distance between p-dimensional vectors 
X and y. Set s = 1. 

Step 3. Put m = AT + s. Arrange the observations in ascending order 
according to d^. Let Cm be the “clean subset” of cardinality m, i.e. the 
set of indexes of the m observations with the smallest d^. 

Step 4. Perform the iCmeans algorithm on observations indexed in Cm, 

using c[ \ as initial centroids. Denote by the final centroid 

of cluster k after performing the clustering algorithm {k — 1, ..., A^. 
Compute nearest-centroid distances as 

di = min [d(x,-, c^)] i = 1, n. 

Step 5. Stop if s = 71 — A". Otherwise increase 5 by 1 and go to Step 3. 

Our method for performing Step 1 of the forward search is to 
compute a preliminary clustering with iC > AT groups, and take as 
robust centroids those of the K largest clusters. This is because 
(multiple) outliers, being unusual, should stand alone or cluster by 
themselves at a large distance from the bulk of the data (see SAS, 1990). 
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However, it must be remarked that results from our algorithm are 
largely unaffected by different choices of cj, if very precise 

estimates of cluster means are not required. When outlier detection and 
ordering of multivariate data are the focus, the crucial requirement is 
that the set of initial centroids be outlier free. The preliminary use of 
cluster analysis for defining a clean subset of multivariate observations 
has been suggested also by Hadi and Simonoff (1993) in the context of 
multiple regression models. 

In most moves from cardinality m to m + 1, just one new 
observation joins the clean subset. This provides an ordering of the data 
according to the given K’-clusters structure, with observations furthest to 
it joining the subset at last stages of the forward search. Furthermore, 
multivariate outliers and other influential observations can be detected 
by simple graphical displays of a variety of measures of cluster cohesion 
computed at successive iterations of Step 4 of our procedure. In the 
examples that follow we consider the largest of maximum within cluster 
distances 



D{s) = max ^max^ x^.,)]|, (1) 

where, at step 5, units i and i' both belong to cluster k. The rationale 
behind (1) is that, if the true structure of the data consists of K clusters 
plus a number of atypical observations, D{s) stays small as long as Cm 
remains outlier free. In addition, the plot of D[s) shows a dramatic 
change when the first outlier joins the clean subset. 



3. Examples 

The effectiveness of the suggested method in detecting masked multiple 
outliers, and more generally in ordering multivariate clustered 
observations, is shown by means of two examples. In the first instance, 
we simulate a data set where a large number of outliers are known to 
exist. Secondly, we apply our algorithm to nonhierarchical clustering of 
European Regions. 

3.1. Simulated data 

In our first example we generate 90 realizations from a eight-variate 
normal distribution with constant variance, through a slightly modified 
version of the algorithm in Jolliffe et al. (1995). Observations are 
generated to fall into six distinct clusters of equal size, whose centroids 
all lie within the unit hypercube. Then we add 10 contaminated 
observations falling near (or slightly outside) the upper boundaries of the 
hypercube. 

A K-clusters partition is obtained through the convergent K-means 
method, with careful selection of seed points. In particular, a number of 
distance tests among potential seeds are performed before their selection, 
as in the FASTCLUS procedure of SAS (1990). Although contaminated 
values are clearly outliers with respect to the bulk of the data, they are 
not revealed by single-case deletion statistics, due to the masking effect, 
even if we correctly set K = 6. 

Then we apply the forward search algorithm described in section 2 
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Figure 1: Simulated data. Plot of D(s) against m for several values of K. 




for different values of K. Robust centroids are computed from a 
preliminary clustering with 20 clusters. In all searches, contaminated 
observations join the clean subset at the last 10 steps of the procedure. 
However, the behaviour of (1) is markedly different according to what 
value of K is actually assumed. Figure 1 shows the plot of D{s) against m 
for jRT = 4, 5 and 6. For simplicity we only display steps such that m > 
45. For < 6, it is apparent that many well-behaved observations 
produce high values of D{s) when included in Cm, due to misspefication 
of the true clustering structure. On the contrary, the plot for K = 6 
reveals a close agreement between the bulk of the data and the 
underlying (true) structure, while contaminated values are clearly 
declared as outliers. Plots of D{s) foi K > 6 also produce similar 
displays, as far as contaminated observations are not included in the set 
of initial centroids. 

3.2. Clustering of European Regions 

In order to identify similar areas with respect to wealth and economic 
development, we classify the Regions of the European Union (EU) 
according to the following (standardized) variables: 

Xi = activity rate; 

X 2 = dependence rate (i.e. ratio between people not aged in the range 
14-65 and people aged in that range); 

X 3 = unemployment rate; 

X 4 = young unemployment rate; 

X^ = % of people employed in Industry; 

Xq = % of people employed in Trade and Services; 

X7 = index number of Gross National Product per inhabitant (measured 
in Purchasing Power Standards). 

Data are taken from database REGIO (Eurostat, 1997) and refer to 
1995, with the exception of X 7 which refers to 1993. European Regions 
are defined according to the NUTS2 aggregation level, except for Sverige 
and some areas in U. K. and Germany, where a coarser aggregation 
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(NUTSl) is adopted due to lack of available information at the required 
level. For the same reason Austria has been excluded from the analysis. 
As a result, we have n = 160 data units in this application. The 
corresponding data set is available from the author upon request. 

As in the example of section 3.1, if-clusters partitions are obtained 
through the convergent means method, for a number of values of K. 
However, here we adopt an improved version of the FASTCLUS seed 
selection technique, in order to reduce the possible effect of the order of 
the observations on clustering results. The adopted seed selection 
procedure is described in Cerioli (1997). 

The forward search algorithm is applied to the computed partitions, 
for 8 < A" < 20. Robust centroids are obtained from a preliminary 
clustering with K = 3K clusters, if A" < 16; otherwise we set IC = 50. 
Figure 2 shows the plot of D{s) against m for selected values of K in the 
range of interest. For simplicity we only display steps such that m > 
100. Although the behaviour of D{s) is not as clearcut as in Figure 1, 
there is some evidence that small values of K do not provide satifactory 
classifications, as a large number of Regions would then disagree with 
the supposed group structure. 

Furthermore, all the reported curves of D(s) show a sharp increase in 
the last stages of the forward search. This means that the last Regions in 
the ordering provided by the forward algorithm can be considered as 
potentially anomalous given the chosen number of clusters. Specifically, 
the Regions joining the clean subset at the last four steps are reported in 
Table 1 for selected values of K. Note that Voreio Aigaio (Greece), 
Brussels (Belgium) and Ceuta y Melilla (Spain) are always the last to 
come in, although in a different order, irrespective of the value of K. 
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Figure 2: Clustering of European Regions. Plot of D(s) against m for 
selected values of K. 




Table 1: Regions of the EU joining the clean subset at the last four steps 



of the forward search for selected values of K. 





K=8 


ir= 10 


12 


K=16 


K=18 


m = 157 


Hamburg 


Sicilia 


Sicilia 


Corse 


Corse 


m = 158 


Voreio 

Aigaio 


Voreio 

Aigaio 


Voreio 

Aigaio 


Brussels 


Ceuta y 
Melilla 


m = 159 
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Abstract: Much of the classification literature ignores notions of probability. In 
our view, this is due in part to a dominant tendency in the early days of 
computers for developing heuristic clustering algorithms and in part due to long 
traditions in classification outside the statistical/probabilistic orbit, of which 
biological taxonomy and book classification are primary examples. Statisticians 
have rightly stressed the role of probabilistic concepts in formulating 
classification problems and in interpreting classifications but we believe that they 
are wrong in suggesting, as they sometimes seem to, that other approaches are 
unsatisfactory. Probability has its proper place in classification but it is neither an 
essential nor always an appropriate tool. We discuss circumstances where non- 
probabilistically-based classifications are fully justified. 

Considerations influencing the differences between the two approaches include: 
1) Irrespective of whether things are to be assembled into classes (arranged 
hierarchically or not) or assigned to previously recognised classes, methodology 
depends on whether the things may be regarded as representing groups or as 
samples from groups; 2) Models are basic to the formulation of statistically 
based classifications, but they may also underpin nonprobabilistic classifications; 
overt models are not a characteristic of heuristic classification algorithms; 3) In 
principle, probabilistic models allow the significance and number of clusters 
justified by data to be assessed. In non-probabilistic classifications (probabilistic 
too), the eighteenth century concept of approximation offers a good basis for 
assessing the adequacy and stability of clusters. 

Keywords: Probabilistic Classification, Non-probabilistic Classification, 

Classes, Groups, Assignment, Class Construction. 
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1. Introduction 

The art of classifying knowledge has a long history reaching back into classical 
times. Book classification has long been essential for librarians and scholars and 
in the eighteenth century Linnaeus standardised the taxonomy of life-forms. The 
organisation of government and of large business enterprises is essential for 
efficiency. Preliminary classification is essential for, and may be designed to 
simplify, information retrieval. Just by naming things we recognise the existence 
of the class of things that bear that name. When the number of associated named 
things grows large, some form of classifying them into manageable, named, 
groups is essential in order to organise and describe what would otherwise be 
chaotic. Thus, the notion of classification is a fundamental human activity. 

In the above examples, probability and models have little if any role to play. 
Although evolutionary^ concepts, which offer the possibility of modelling, are of 
current interest, they played no part in Linnaeus' foundations of taxonomy. 
Indeed, one of the reasons why biological taxonomy is so fascinating is because 
of the interplay between the cladists who seek classifications based on evolution 
and the pheneticists who are content with utilitarian classifications. The two 
approaches are not independent because species which are close in evolution will 
tend to share many common features and hence are likely also to be phenetically 
close. This argument which dominated the early days of numerical classification 
is of importance to biologists but has little relevance to classifications in most 
other fields of application, except, perhaps, to highlight what should be obvious, 
that there is no uniquely correct classification - it depends why one is classifying. 

When classifying books or major species-groups, notions of variability are of 
little relevance. Statistics is concerned with handling variability. Discriminant 
analysis and the multivariate mixture problem are two classical classification 
problems handled by statistical methods. Both are concerned with two or more 
populations with several variables whose values are characterised by probability 
distributions within the populations. To force other situations into a framework 
based on populations modelled by probability distributions can be misguided. 

Below, we first focus on how the status of the things being classified interacts 
with admissible criteria for defining classes and then we identify a few problems 
where nonprobabilistic classification is appropriate. Finally, we comment on 
how in non-probabilistic classifications the concepts of statistical inference may 
be replaced, and perhaps bettered, by the older concept of approximation. 
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2. Types of variable and the things being classified 

Three main ingredients of all classification problems are (i) the things being 
classified, (ii) the variables used to describe these things and (iii) the purposes of 
the proposed classification. In this section we mainly discuss (i) ending with a 
brief reference to (ii); some purposes of classification are discussed in section 3. 
Customarily, data for classification are presented in a matrix X with N rows 
pertaining to the objects and p columns pertaining to the variables; we shall see 
that this can be a simplistic view of the structure of the data. 

Table 1: Types of classification problem depending on whether (i) the problem is 
probabilistic or non-probabilistic, (ii) is for assignment to classes or 
construction of classes, or (iii) is concerned with structured or unstructured 
objects. (Simplification of a similar table in Gower, 1998) 





Obiects 




Construction 


Non-probabilistic 


Structured 


Matching 
Diagnostic keys 


Maximal predictive classes 
Cluster analysis: k-groups, 
hierarchic, other. 


Probabilistic 




fNuin 


Mixture problems 


Structured 


Discrimination 


Undeveloped 



Table 1 is a three-way table giving a breakdown of some well-known 
classification problems. As well as distinguishing between (i) probabilistic and 
non-probabilistic classifications and (ii) between assignment/identification and 
class-construction problems. Table 1 distinguishes (iii) between the objects being 
classified as structured or unstructured. We take the meanings of (i) and (ii) to 
need no further explanation; what we mean by (iii) is best understood by 
considering the usual between and within groups sample-structure that is 
familiar in simple analysis of variance and in canonical variate analysis. Thus, 
the N rows of our data matrix X refer to objects drawn from g groups the fcth of 
which has n^ members and N = The g groups may have further a priori 
structure but, at least for the present, we shall assume that this is not so. 

We begin by considering two important special cases of this general type of 
data-structure. Type A: g = 1 so that A = nj and Type B: nk= 1 for k = l,2,...,g. 
Type A is the classical multivariate data-matrix of statistical multivariate 
analysis. Typically, its rows represent a random sample of size N = nj. The 
samples have no a priori structure, so are said to be unstructured. Note that it is 
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the data that are unstructured, but nevertheless they might be modelled by a 
highly structured model. Indeed, that assumption is the starting point of the 
multivariate mixture problem which seeks a representation of the underlying 
probability distribution as a mixture of g more simple distributions. After 
analysis, the previously unstructured samples can be assigned to the g newly- 
found distribution- groups and then X becomes structured. With Type B we have 
g groups at the outset where each group is represented by one object, or row, of 
X. This is quite different from the familiar statistical data-matrix and would be an 
inadequate representation of reality when variability within groups is substantial. 
However, there are many classification problems where within-group variability 
may be ignored. For example, the ^th group may have a unique member (there 
is only one Rome) or we may be able to choose variables which do not vary 
within groups (all cats have claws) but do between groups. The latter is a very 
common situation and one may note that those interested in classifying groups 
will seek to describe the groups by variables which are constant; indeed, they 
would be foolish not to do this when it is a possibility. When such variables are 
unavailable, the only remaining possibility is to base classifications on variables 
which do vary within groups, in which case probabilistic methods become 
relevant and the choice nk= 1 becomes untenable; then replication of samples 
within groups is essential to capture the within-group variability. Within groups 
variation is a familiar component of the assignment problem of classical 
discriminant analysis but at the end of section 4 we sketch how it can also be 
important when constructing classes. 

With groups, even when nk= 1,X is said to be structured. Normally, the groups 
will bear names indicating an initial classification. Botanists recognise species 
like daisies and dandelions, linguists recognise languages, geographers recognise 
countries, librarians recognise book-titles and so on. This initial classification 
does not inhibit a desire for further classification; rather it encourages 
classification to organise what can be very large bodies of information. 

To recapitulate, the objects to be classified are said to be structured when they fall 
into named groups, which implies that a preliminary classification is recognised. 
Otherwise they are said to be unstructured, which usually means that the rows of 
X are undifferentiated as with a random sample assumed to be drawn from 
some notional population or mixture of populations. In table 1 all non- 
probabilistic classification problems are concerned with objects of Type B. In 
probabilistic classification problems we meet with both structured (e.g. canonical 
variate analysis) and unstructured (e.g. multivariate mixtures) sets of objects. 
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With replication within groups, whether or not the nt replicates are chosen by 
some random process is important. It is not obvious that simple random 
samples are necessarily a desirable basis for forming classifications. For 
example, the number of speakers of a language is unlikely to be relevant for 
classifying languages thus excluding simple random sampling. Indeed, it is the 
distribution of variables within the groups which are important, not the relative 
frequencies of speakers; each language might be represented by a single set of 
characteristics as in taxonomy where within-group variation is often handled by 
representing each group by a single invented object with typical values; this is 
acceptable when within group variation is negligible. Type B data are an extreme 
form of non-random sampling but are often the proper object structure. 

We have considered the most simple between-within structure where the g 
groups represent the totality of objects to be classified. This does not preclude the 
possibility of additional groups being added at a later stage. Also the groups 
might be regarded as a random sample of some larger set, and then probabilistic 
methods might have a role to play; to us this seems an artificial set-up. The 
groups might have an elaborate a priori imposed structure of the crossed and 
nested kinds, indicating an equally elaborate a priori classification. Then it seems 
unlikely that one would wish to use this classification as the starting point for 
further classification. Rather, one would begin again, subsuming the complex 
structure into a simple group structure and comparing any new classification 
with the old one. No further account is taken of these possibilities here but they 
point the way to some areas of future research in classification theory. 

The above has said little about the types of variable used in classification. 
Variables may be numerical or categorical. If numerical, they may be continuous 
on ratio or interval scales, a distinction which rarely affects classification. 
Categorical variables may be nominal, ordinal or dichotomous, a special binary 
categorical variable with one category merely not the other; all these types 
contain different kinds of information which may be exploited when classifying 
things. Numerical variables are less important for non-probabilistic classification 
than are categorical variables. This is because numerical variables are likely to 
vary within groups, so when non- varying categorical variables can be found they 
are to be preferred. Just as objects may be structured, so may variables. Structure 
in variables has been long-recognised in survey design and Gower (1971) look it 
into account when designing a general similarity coefficient in which primary 
variables could be associated with sets of secondary variables, in turn associated 
with tertiary variables, and so on. 
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3. Non probabilistic classification problems 

Table 1 lists some non-probabilistic classification problems, both for assignment 
to classes and for the construction of classes; the starting point is data of Type B 

Identification keys constructed by ad hoc methods have been known to botanists 
for several centuries. To use a key one has a single object which has to be 
identified, that is it has to be named. The most simple thing to do is to compare 
the object with each row of X until a match is found, when identification is 
achieved. This is inefficient because an enormous number of comparisons may 
have to be made, and it may be impracticable because variables included in X 
may be unavailable (e.g. a flowering plant may give no information on the 
characteristics of its seeds). Diagnostic keys overcome these difficulties. A key 
is essentially a tree with one binary variable at each node. The value of the binary 
variable for the object to be identified determines which of the two possible 
branches of the tree one traverses to reach the next node. In this way one 
answers a series of questions until one reaches an end-point of the tree, where 
the identification is given. Many interesting problems in constructing keys have 
been reviewed by Payne and Preece (1980). We may require the tree with fewest 
nodes or using fewest different among the binary variables or with minimum 
average numbers of steps to achieve identification. Costs may be associated with 
ascertaining the values of the binary variables, in which case we may require the 
key that is cheapest to use. Probability concepts are irrelevant for these 
interesting problems but may enter if we recognise that the objects have different 
frequencies of occurrence, so affecting average numbers of steps to identification 
and average costs. However, any probabilistic distribution associated with the 
binary variables themselves is not required nor is it material. 

When it comes to constructing classifications with the same form of X (i.e. with 
binary variables) we can classify the g objects into k<g groups in such a way that 
on being told that an object belongs to one of these groups, more correct 
statements about the likely value of its binary variables can be made than for any 
other classification. This is maximal predictive classification (Gower, 1975) 
which models the dictum of a distinguished botanical taxonomist, Gilmour 
(1937), that a system of classification is the more natural the more propositions 
there are that can be made regarding its constituent classes. No probability 
distribution is associated with the binary variables, although again the relative 
frequencies of the objects may be accommodated. 
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We may apply the ^-means algorithm to numerical Type B data, thus forming 
homogeneous groups among the N objects without any appeal to probability. 
This is an interesting example, because if X is regarded as of Type A then the 
algorithm finds the maximum likelihood solution to the mixture problem 
modelled as a combination of multinormal populations. Thus, the same 
algorithm may be used to compute two different classification problems, one 
probabilistic and the other not. Of course, different mixture models have 
different maximum likelihood solutions while the /:-means algorithm is only 
valid for multinormal mixtures. 

Many, mostly heuristic, algorithms applied to Type B data give a hierarchical 
classifications of the g objects. Heuristic algorithms are acceptable when there is 
no practicable way of optimising an objective criterion (e.g. NP complete 
problems). Ultrametrics and additive trees give objective criteria for fitting trees 
by least squares, thus providing well-defined models. The use of least-squares 
does not necessarily imply an appeal to probability. One may note the eighteenth 
century work of mathematicians like Legendre and Laguerre who used Li, L 2 , or 
Loo norms to approximate complicated functions (e.g. Bessel functions) by 
polynomials. Polynomials give an acceptable approximation to the function - 
probability is irrelevant. Analogously, we may regard nonprobabilistic 
classifications as giving similar approximations to X where the goodness of fit 
may be used to assess the adequacy of the tree approximation or of the class- 
predictors fitted to maximise prediction. Such measures of approximation are at 
least as useful as significance tests associated with probabilistic models. 
Remember, significant is not synonymous with important and not significant is 
not synonymous with unimportant, a point we return to below. 

Other nonprobabilistic objectives for classification concern mixtures of 
hierarchical and non hierarchical organisation. We may wish to classify the g 
groups into k<g classes arranged hierarchically with each class containing one or 
more undifferentiated members. Indeed, it seems to us that this is more 
frequently required than full hierarchical classification but it rarely gets 
mentioned, except when pointing out that branches of a full tree may be 
amalgamated, perhaps governed by specifying some threshold level in a 
dendrogram. Again with Type A data we could formulate a variant of the 
mixture problem in which the k classes are to be arranged hierarchically. Gower 
(1975) gives a method for constraining maximal predictive classes to have 
hierarchical organisation which immediately extends to other classification 
criteria; unfortunately the computational problems are formidable. 
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4. Conclusion 

We hope we have shown that nonprobabilistic classification criteria are often 
relevant and lead to interesting and useful methods. Although we distinguish 
between assignment to and construction of classes, a perfectly good criterion for 
constructing classes is so that they can be assigned to optimally; nonprobabilistic 
maximal predictive classes and maximum likelihood mixture classes both have 
optimal assignment properties. We have highlighted two extreme forms of X, 
Type B (g=N) for which nonprobabilistic methods are often applicable and Type 
A (g=l) where probabilistic methods are often applicable. That computers 
cannot distinguish between Types A and B has encouraged some misuse of 
classification software, a situation made the more confusing when the same 
algorithm may be used on either form of X to solve different problems. 

Important as it is, not all useful classificatory information need be probabilistic. 
Suppose we wish to classify g normal populations into k<g groups. Following 
discrimination ideas, we could seek boundaries which minimise the overlap 
between the k groups, thus minimising future errors of misclassification. With 
limiting point-densities any set of boundaries give no overlap and no possibility 
of misclassification. Nevertheless, it remains reasonable to require a grouping of 
the populations into k classes, possibly nested, by grouping together pairs of 
point-populations that are closer than others, as judged by distances based on the 
nonprobabilistic information contained in the values of the variables. When the 
point densities expand to conventional distributions, it seems that a combination 
of probabilistic and nonprobabilistic information should be used for constructing 
classifications; assignment to these classes might be entirely probabilistic. 
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Abstract: A problem common to all clustering techniques is the difficulty 
of deciding the number of clusters present in the data. The aim of this paper 
is to assess the performance of the best stopping rules from the Milligan and 
Cooper’s (1985) study, on specific artificial data sets containing a particular 
cluster structure. To provide a variety of solutions the data sets are analysed 
by four clustering procedures. We compare also these results with those ob- 
tained by three methods based on the hypervolume clustering criterion. 

Keywords: clustering, stopping rule, number of clusters, hypervolume crite- 
rion. 

1. Introduction 

Most clustering algorithms can partition a data set into any specified num- 
ber of clusters even if the data set contains no structure. So one of the most 
important problems when validating the results of a cluster analysis is: how 
many clusters are in the data? Some studies have been proposed, to compare 
procedures for determining the number of clusters. For example, Milligan and 
Cooper (1985) conducted a Monte Carlo evaluation of thirty indices for deter- 
mining the number of clusters. Hardy (1994) compared three methods based 
on the hypervolume clustering criterion with four other methods available in 
the Clustan (1978) software. More recently, Gordon (1997) modified the five 
stopping rules whose performance was best in Milligan and Cooper’s (1985) 
study in order to detect when several different, widely- separated values of c, 
the number of clusters, would be appropriate, that is, when a structure is de- 
tectable at several different scales. 

This paper addresses the problem of assessing the performance of the best 
procedures from the Milligan and Cooper’s (1985) study, on specific artificial 
two-dimensional data sets showing a particular cluster structure considered 
neither in Milligan and Cooper’s (1985) study nor in Gordon’s (1997) paper. 
We also compare these results with those obtained by three methods based on 
the hypervolume clustering criterion. 
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2. The hyper volume clustering criterion 

We assume a clustering model when the observed points are a realization of a 
Poisson process in a set D of where D is the union of k disjoint convex 
domains Di, D2, Dk] C{ C {xi, X2, Xn} is the subset of observations 
belonging to Di {1 < i < k). The problem is to estimate the unknown domains 
Di in which the points are distributed. The maximum likelihood estimation of 
the k subsets Di, D2, Dfc is constituted by the k subgroups Ci of points 
such that the sum of the Lebesgue measures of their disjoint convex hulls 
H{Ci) is minimum (1 < i < A:) (Hardy and Rasson (1982); Hardy (1983)). 
The hypervolume clustering criterion is thus defined as 

W{P, fc) = E rniH{Q)) 

i = I 

where P is a partition of the observed points into k clusters, H{Ci) is the 
convex hull of the points belonging to Ci and m{H{Ci)) is the m-dimensional 
Lebesgue measure of that convex hull. 

3. Methods to determine the number of clusters 

The first six indices - those whose performance was best in Milligan and 
Cooper’s (1985) study - are the the Calinski and Harabasz (1974) method 
(Ml), the J index (Duda and Hart (1973)) (M2), the C index (Hubert and 
Levin (1975)) (M3), the 7 index (Goodman and Kruskal (1954)) (M4), the 
Beale (1969) test (M5), and the Cubic Clustering Criterion (Me) (Sarle, 1983). 

The last three methods are based on the hypervolume criterion. 

3.1 A classical geometric method (M7) 

This well-known method consists in plotting a criterion value W against A:, the 
number of clusters. With every increase in k there will be a decrease in W. 
A discontinuity in slope should correspond to the true number of “natural” 
clusters. W is here the hypervolume criterion. 

3.2 Method based on the estimation of a convex set (M8) C" 

If we consider a realization of a homogeneous planar Poisson process of un- 
known intensity within a compact convex set P, the best estimate of D is 
given by D' = g{H{D)) -h c • s{H{D)) where H{D) is the convex hull of the 
points belonging to P, g{H{D)) is the centroid of H{D) and s{H{D)) = 
H{D) — g{H{D)). So P' is a dilation of the convex hull about its centroid 
(Ripley, Rasson (1977)). The realization of a Poisson process within the union 
P of A: subsets Pi, P2, ..., P/t can be considered as the realization of k Pois- 
son processes of the same intensity within the k subsets Pi, P2, Pfc- We 
apply that result to the hypervolume clustering problem. For k = 2, 3, ..., we 
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consider the dilated convex hulls of the clusters of the best partition of the set 
of objects into k clusters. If fco is the first value of k for which at least one 
pair of dilated convex hulls do intersect, we then consider fco - 1 as the optimal 
number of clusters (Hardy (1994)). 

3.3 A likelihood ratio test (M9) 

Let Xi, X 2 , , Xn be a random sample from a Poisson process on fc disjoint 
convex sets Di, D 2 , ..., Dfc- Foi* ^ given integer fc > 2, we test whether a 
subdivision into fc clusters is significantly better than a subdivision into fc — 1 
clusters, i.e. the hypothesis Hq : t = k against the alternative Hi : t = k — 1, 
where t represents the number of clusters. 

The statistic of the test is given by (Hardy, 1994): S{x) = 

We have S{x) G [0, 1]. Thus we will reject Hq iff S{x) > C, where C is a 
constant. Unfortunately we do not know the Hq distribution of the statistic 
S. Nevertheless in practice, we can use the following rule: reject Hq if S takes 
large values i.e. if S is close to 1. So we will apply the test in a sequential way: 
if fco is the first value of fc for which we reject Hq, we shall consider fco — 1 as 
the appropriate number of natural clusters. 

4. Results 

In order to make the assessment of the six best stopping rules from the Mil- 
ligan and Cooper’s study, we have considered four well-known hierarchical 
clustering procedures (nearest neighbour, furthest neighbour, group average 
and Ward) and six artificial data sets containing a known structure (one sin- 
gle homogeneous group, two or three elongated clusters, linearly nonseparable 
clusters, unequal-size hyperspherical-shaped clusters, nonconvex clusters). We 
have then applied the six indices to the results obtained by the application of 
the clustering methods to the data sets test. 

4.1 First set of data: absence of any group structure 

Here we have simulated a Poisson process into a single rectangular set in the 
plane; so the 150 points are independently and uniformly distributed in this 
set. The performance of the six indices are tabulated in Table 1. 





Ml 


M2 


M3 


M4 


M5 


M6 


nearest neighbour 


4 


1 


1 


1 


1 


1 


furthest neighbour 


4 


1 or 3 


1 


1 


1 


1 


group average 


4 


1 


1 


1 


1 


1 


Ward 


4 


1 or 3 


1 


1 


1 


1 



Table 1: One homogeneous group. 
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The first column of a table lists the names of the clustering procedures. The 
other columns show the results given by the best six stopping rules from the 
Milligan and Cooper’s study. 

Methods M3, M4, M5, Mq detect one cluster. The Duda and Hart’s rule (M2) 
requires the computation of a critical value. One of the parameters used in the 
computation is z, a standard normal score. Milligan and Cooper chose z to be 

3.2 because for the examples considered in their paper, this value of z produces 
the best recovery. In his study Gordon specifies z to be 4.0. If z is equal to 
4, M2 gives in this example the correct number of clusters. If we choose 3.2 
for z, M2 detects three clusters in two cases (see Table 1). The Calihski and 
Harabasz’s method (Mi) doesn’t recover the true structure in this example. 

4.2 Second set of data: two elongated clusters 

In this example, the natural structure consists of two elongated parallel clus- 
ters. The results appear in Table 2. 







Ml 


M2 


M3 


M4 


M5 


M6 


nearest neighbour 




6 


1 


1 


1 


1 


1 


furthest neighbour 


- 


2 


2 


1 


1 


1 


1 


group average 


- 


2 


2 


1 


1 


1 


1 


Ward 


- 


2 


2 


1 


1 


1 


1 



Table 2: Two elongated clusters. 



A “-f” in the second column of Table 2 indicates that if we apply a clustering 
method in order to get two clusters, we obtain the two “natural” elongated 
clusters; dashed entries correspond to situations where the classification ob- 
tained is not the expected one. Only the nearest neighbour method reproduces 
the “natural” classification of the data, and in that case, the six best rules from 
the Milligan and Cooper’s study are ineffective in predicting the correct num- 
ber of clusters. Milligan and Cooper noted that the two clusters case was the 
most difficult structure for the stopping rules to detect. Let us remark that 
some rules determine the correct number of clusters, for a partition into two 
clusters that is not the natural one. 

4.3 Third set of data: three elongated clusters 

In this example, artificial data were generated to contain three elongated par- 
allel clusters. The results obtained are given in Table 3. 
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Ml 


M2 


M3 


M4 


M5 


M6 


nearest neighbour 


+ 


2 


1 


1 


1 


1 


1 


furthest neighbour 


- 


2 


2 


1 


1 


1 


1 


group average 


- 


2 


2 


1 


1 


1 


1 


Ward 


- 


2 


2 


1 


1 


1 


1 



Table 3: Three elongated parallel clusters. 



The results, here, are similar to those of the second data set. Only the near- 
est neighbour method reproduces the “natural” classification when we fix the 
number of clusters to three. The furthest neighbour, group average and Ward 
procedures tend to produce hyperspherical-shaped clusters. So it is not sur- 
prising to see that the results obtained here are similar to those of the previous 
data set; the Calihski and Harabasz’s method and the J index detect two clus- 
ters in almost all cases in both examples; the other rules dectect only one 
cluster. 

4.4 Fourth set of data: three elongated nonseparable clusters 

In this example, we have chosen three elongated clusters such that no one is 
linearly separable from the two others. Many clustering methods fail to detect 
the natural partition into three groups. The results appear in Table 4. 
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M3 


M4 


M5 


M6 


nearest neighbour 


d- 
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1 


1 


1 


furthest neighbour 
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1 


1 or 3 
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1 


1 


1 


group average 


- 


1 


1 


1 


1 


1 


1 


Ward 


- 


1 


1 or 3 


1 


1 


1 


1 



Table 4: Three elongated nonseparable clusters. 



The nearest neighbour method is efficient in recovering the underlying struc- 
ture into three clusters. The Calihski and Harabasz’s method (Mi) applied 
to the results of the nearest neighbour procedure gives the correct number of 
clusters. If we choose 3.2 for 2 :, the Duda and Hart’s method (M 2 ) detects 
three clusters in two cases. If z is equal to 4, M 2 gives only one cluster. The 
other rules detect only one cluster. 

4.5 Fifth set of data: unequal-size hyperspherical-shaped clusters 

This set of data consists of two groups; the first one contains eighty points, 
and the second one twelve points. The distance between the two groups is 
larger than the diameter of each of the groups. So the two natural clusters are 
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well separated. The results obtained are tabulated in Table 5. 
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nearest neighbour 
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furthest neighbour 
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2 
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2 


2 


1 


3 


group average 


+ 


2 


2 


2 


2 


1 


6 


Ward 
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2 


2 


2 


2 


1 


6 



Table 5: Unequal-size hyperspherical-shaped clusters. 



The Beale test (M5) and the cubic clustering criterion (Me) perform poorly; 
they fail in recognizing the true structure of the data, except when Me is ap- 
plied to the results given by the nearest neighbour method. The first four rules 
predict the correct number of clusters. 

4.6 Sixth set of data: nonconvex linearly nonseparable clusters 

This set of data contains two clusters. Both are nonconvex. They are not 
linearly separable. The results obtained are given in Table 6. 
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group average 
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1 


Ward 


- 


1 


1 


1 


1 


1 


1 



Table 6: Nonconvex linearly nonseparable clusters. 



Here also, only the nearest neighbour method retrieves the natural structure of 
the data when we fix the number of clusters to two. The six best methods from 
the Milligan and Cooper’s study are unable to determine the correct number 
of clusters. 

Results associated with the hypervolume criterion 

We’ve applied the hypervolume clustering procedure to the same data sets and 
we’ve recorded the number of clusters given by the three methods M7, Ms and 
Mg, based on the hypervolume criterion. 

In the first five examples the hypervolume method retrieves the natural struc- 
ture of the data when we fix the number of clusters to its correct value. The 
hypervolume method is based on a hypothesis of convexity for the clusters. 
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In the sixth example, the clusters are nonconvex linearly nonseparable. So 
in that case it is impossible for the hypervolume method to detect the two 
natural nonconvex clusters. 

The three methods for the determination of the number of clusters based on 
the hypervolume criterion (M 7 , Mg, Mg) always correctly estimate the number 
of clusters in the first five examples. In the sixth data set the three methods 
detect only one group. 

5. Conclusions 

In the Milligan and Cooper’s study, the clusters were internally cohesive and 
well separated in the variable space. So the aim of this paper was first to 
investigate the performance of the six best procedures from the Milligan and 
Cooper’s study on specific data structures not considered by these authors. 
They themselves pointed out that the results that they have obtained are likely 
to be data dependant, and that it would not be surprising that the ordering 
of the indices would change if different data structures were used. 

Prom the examples considered in this paper, we can conclude that the six 
best methods of the Milligan and Cooper’s investigation are usually unable to 
detect elongated clusters, linearly nonseparable clusters and nonconvex clus- 
ters. That’s not very surprising because some of these rules, being based 
on between-groups and within-groups sum-of- squares, are better able to de- 
tect hyperspherical-shaped clusters. In addition, the Calihski and Harabasz 
method (Mi), and the J index if 2 : is equal to 3.2, have difficulties to detect 
the presence of one single homogeneous group. The Beale test (Mg) and the 
Cubic Clustering Criterion (Mg) perform very poorly in presence of unequal 
cluster sizes. 

Concerning the methods based on the hypervolume criterion, they performed 
at a competitive rate on the examples considered here, except in the case of 
nonconvex linearly nonseparable clusters. 

Nevertheless we can recommend the collective use of the procedures considered 
here, being aware of the fact that the data may contain a particular cluster 
structure. 

Overall, the problem of estimating the number of clusters in multidimensional 
data remains a difficult and challenging problem. No completely satisfactory 
solution is available. Maybe the question is incapable of any formal or com- 
plete solution simply because there is nowadays no general agreement on an 
universally acceptable definition of the term cluster. 
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Abstract: Fixed Point Cluster Analysis (FPCA) is introduced in this paper. 
FPCA is a new method for non-hierarchical cluster analysis. It is related to 
outlier identification. Its aim is to find groups of points generated by a com- 
mon stochastic model without assuming a global model for the whole dataset. 
FPCA allows for points not belonging to any cluster, for the existence of 
clusters with a different shape, and for overlapping clusters. FPCA is applic- 
ated to the clustering of p— dimensional metrical data, 0-1-vectors, and linear 
regression data. 

Keywords: Stochastical clustering, overlapping clusters, mixture model, out- 
lier identification, linear regression 



!• The cluster concept of Fixed Point Clusters 

Sometimes, a dataset does not consist of a partition into some “natural clusters”, 
but nevertheless there seem to be clusters, perhaps of different types. The fol- 
lowing dataset is taken from Hand et al. (1994), p. 58, and gives the rates of 
child mortality (y— axis) and adult female literacy (x— axis) of 29 countries. 

Figure 1: UNICEF-data 




There seems to be some “grouping” in the data. A linear regression model 
could be adequate for all or a large part of the filled points, but hardly for 
considerable parts of the rest. 
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The aim of Fixed Point Clustering is to find ‘‘natural clusters” in a stochastic 
sense. For a given dataset, a “cluster” in a stochastic sense is a subset whose 
points are described adequately by a common distribution. This distribu- 
tion comes from a parametric family of “cluster reference distributions” which 
model the form of the clusters of interest. For example, it could be Normal or 
other suitable unimodal distributions, while bimodal distributions heuristic- 
ally mostly generate two clusters. Frequently, such stochastic cluster analysis 
problems are treated by means of a mixture model 

P = ^^iPi, = e, > 0, * = (1) 

^■=l i=l 

where the Pi^i = 1 ,..., 5, are cluster reference distributions with different 
parameters and the z = 1 , . . . , 5, are the proportions of the s clusters. While 
the estimation of s is difficult, the parameters of the P{ can often be estimated 
consistently by Maximum Likelihood estimators if s is known. 

In difference to that, FPCA is a local cluster concept, i.e. the cluster 
property of a data subset depends only on the particular set and its relation to 
the whole dataset, but not on a global model. FPCA bases on “contamination 
models” of the form 



P — {I — c)Pq -|- eP*, 0 < e < 1, (2) 

where Pq is a cluster reference distribution. P* can be arbitrary. In particular, 
it can be a mixture of other cluster reference distributions so that this model 
includes mixture distributions of the form (1). Additionally, P* has to be 
“well seperated” from Pq in order to guarantee that the points generated by 
Pq form a cluster with respect to the whole dataset. An ideal FPC should 
consist of the points generated by Pq. Contamination models are often used 
in robust statistics and outlier identification, and a cluster could be seen as a 
data subset which belongs together, i.e. it contains no outliers, and which is 
separated from the remaining data, i.e. all other data points are outliers w.r.t. 
the cluster. FPCA provides a formalization of this imagination. 

This concept can cope with more complex data situations: 

• If a dataset contains outliers, they need not to be included in any cluster. 

• Sometimes a dataset contains clusters of various shapes. For example, 
there are growth curves in biology, where linear growth is followed by 
logarithmic growth. 

• Clusters can overlap. 

• No assumption on the number of clusters is needed. 
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2. Definition of fixed point clusters 

An FPC is a subset of the data, which does not contain any outlier. All 
other points of the dataset have to be outliers with respect to the FPC. 

More formally: Let M be the space of the data points and Zq G = 

( 2 : 1 , ... , Zn)' be the dataset. For m G IN let 

Im : 1 -^ {0, 1}^ (Z G is mapped on a mapping.) (3) 

an ‘‘outlier identifier” with the interpretation that a point z E M is outlier 
with respect to Z G if /m[Z](z) = 1. The construction of reasonable 

outlier identifiers will be discussed in later sections. For an indicator vector 
9 = { 91 ^ • • • , 971 )' ^ {0, let Zo{ 9 ) be the matrix of the data points Zi, for 
which 9i — l^ and n{g) be the number of these points. 

Definition. With outlier identifiers A, . . . , /n define 

fzo • {0, {0, 1}’^: ^ (1 ~ In{g)['Zo{9)]{^i))i=l,...,n, (4) 

/zo 9 on the indicator vector of the Zo{g)— non-outliers. Then Zo{g) 
is an FPC if g is a fixed point of fzo- Thus, the non-outliers w.r.t. Zo{g) have 
to be precisely the points ofZo{g). 

This definition can be related to stochastic modeling as follows: Let M be 
a space with cr— Algebra B. Let V be the space of distributions on (M, B), and 
Vq GV the set of cluster reference distributions. 

Definition. (Dame 5 and Gather 1993) A{P) is called a— outlier region w.r.t. 
P under 

A: V^B with VP G Po : P{A{P)) < a. (5) 

0 < a < 1 should be small so that the points in A[P) “lie really outside” . 
Now, the set^f points Am{Z) := {z : Im[Z]{z) = 1} is an estimator of A{P), 
if Z is a dataset of i.i.d. points distributed by P. If Z = Zo{g) forms an FPC 
w.r.t. Zo, Am{Zy, i.e. the set of Z-non-outliers, is an estimator (whose quality 
is to be assessed) for an FPC-set of a distribution: 

Definition. For a given distribution P G P and B ^ B let Pb := P{^\B) the 
distribution P restricted to the points of B. Let A be an outlier region. Define 

fp: B^B, B^A{PbY. (6) 

B is an FPC-set w.r.t. P if B is a fixed point of fp. 

With help of this definition, the quality of the results of an FPC A with 
respect to distributions P of the form (2) can be investigated by comparing 
the FPC-sets P of P (the distributions Pg, respectively) to Pq, e.g. the high 
density areas of Pq. 




40 



The computation of all FPCs of a dataset requires the evaluation of (4) 
for every subset of data. This is impossible, except for very small datasets. 
Instead, one can apply the usual fixed point algorithm 

9'^^ = fig') ( 7 ) 

to various starting vectors g^. 

Now, the application of this concept to various clustering problems will be 
demonstrated. 



3. Multidimensional Normal clusters 

The estimation of Normal mixture parameters is discussed broadly in the lit- 
erature, see e.g. Titterington et al. (1985). In this setup, a cluster of points 
from IR^ is considered as a group of points generated by a common Afp{g^ S)- 
distribution. The definition of FPCs is given by the definition of the cluster 
reference distributions and the corresponding outlier identifier. Here: 

Vo := {ATp(//,S) : (i 6 pos. semidef. p x p}. (8) 

An easy a— outlier region for the distributions of Vo has the form 

AiAf.i,,, S)) ;= {x€IR’’:{x- f,)"E-\x - /x) > (9) 

Such outlier regions can be estimated by replacing g and S by estimators (im 
and This leads to the following outlier identifier: For X G and 

xelBF : 

/^[X](x) := 1 ((x - /i„(X))'S„(X)-'(x - MX)) > xU-.) • (10) 

Chosing Pm and Ytm as mean vector and sample covariance matrix leads to an 
outlier identification by means of the Mahalanobis distance. This approach is 
computationally simple and is discussed e.g. in Rousseeuw and Leroy (1988). 
They show that a better identification procedure can be obtained if pm and 
'Em are replaced by more robust estimates. 



4, Clustering 0-1- vectors 

Let Z G ({0, 1}^)’^ a dataset of dimensional 0-1-vectors. Then data points 
can be considered as belonging to a common cluster if they seem to be gener- 
ated by the same independent product of Bernoulli distributions. That is, 

k 

P[k,pi,...,pk] := 05[l,pj], Vo ••= [P[k,pi,...,pk] : {pi,...,pk) e [0,1]*}, 

J = l 
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where B[k^p] denotes the Binomial distribution with parameters k^p. An a- 
outlier region for this kind of data can be defined by counting the components 
for which a data point does not belong to the expected majority in the cluster: 

WP[k,pi,. . .,pk] e Vo : 

A{P[k,pi,...,pk\) [z G {0,1}'= : v\pi, . . . ,pk]{z) > c(a)}, (11) 

where c(a) := min |c : 1 — B[k, |](c) < , 

n[pi, . . .,pk]{z) := y; [ 2:^1 (pj < ^) + (1 - Zj)l . 

Here v counts the “minority components” of ^ and c{a) is chosen such that 
(5) holds for all P G 'Po, because B[k^ |] is the distribution of u[|, . . . , \]{z). 
This is stochastically largest among the v[pi ^ . . . ^Pk]{z)^ where £( 2 :) G Vq. 

Again, a proper outlier identifier can be defined by replacing p\^. . . ^pk by 
estimators in the definition of u, e.g. 

^ m 

Pjm(X) — y x.j, where X := G ({0,1}*=)™. (12) 

^ i=l 

5. Linear regression clusters 

For analyzing the UNICEF-Data, the female literacy can be considered as 
explanatory variable for the child mortality. While there is no clear structure 
for the whole dataset, the relation for certain subsets of the data seems to 
be approximatively linear. Cluster reference distributions for linear regression 



can be defined as follows: 

Vo:={Pp,a2,o: /ielR^^\a^eIR-^,Geg} (13) 

where Pp^a'^^o C IB^^^ is the distribution of {x,y) G IBF^^ x JR defined by 

y = u~A/'o,^2, (14) 

(iTi, . . . , Xp)' ~ G independently of u, Xp+i == 1. (15) 

Q is some suitable space of distributions on [IR^ ^ IB^)^ and a simple a-outlier 
region can be constructed: 

A{Pp,a2^G) - {{x,y) eIR^x{l}xIR:{y- x'/Jf > c{a)a^}, (16) 

where c(a) is the 1 — a-quantile of the x^"distribution. 



If (3 and are replaced by estimators ^rn and the indicator function 
of A is again a reasonable outlier identifier. The best choices for and 
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would be robust estimators, but they impose formal and numerical problems. 
If one uses Least Squares and 

d^(X) := (17) 

where X := ((xi, yi), . . . , (x,n, ?/m)) G {W x {1} x ]R)^, 

then the subset marked by full points in figure 1 turns out as an FPC with re- 
spect to the UNICEF-data. This subset contains all south american countries, 
Oman and Czechoslovakia, the only european country in the sample. 



6. Conclusion 

FPC A needs less restrictive assumptions than usual cluster analysis methods 
based on stochastic models. It enables the analysis of more complex data situ- 
ations. The idea applies to various problems of cluster analysis. Theoretical 
background is presented in Hennig (1997), especially for the linear regression 
setup. 
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Abstract: In this paper we propose an approach for clustering large datasets of 

mixed units based on representation of clusters by distributions of values of vari- 
ables over a cluster - histograms, that are compatible with merging of clusters. The 
proposed representation can be used also for clustering symbolic data. On the basis 
of this representation the adapted versions of leaders method and adding method 
were implemented. The proposed approach was successfully applied to several 
large datasets. 

Keywords: large datasets, clustering, mixed units, distribution description com- 

patible with merging of clusters, leaders method, adding method. 



1. Introduction 

In this paper we propose an approach for clustering large datasets of mixed (nonho- 
mogenous) units - units described by variables measured in different types of scales 
(numerical, ordinal, nominal). 

Let S be a finite set of units. A nonempty subset C C E is called a cluster. 
A set of clusters C = {CJ forms a clustering. In this paper we shall require that 
every clustering C is a partition of E. 

The clustering problem can be formulated as an optimization problem: 

Determine the clustering C* G for which 

P(C*)=minP(C) 

where $ is a set of feasible clusterings and P : ^ ^ IR^ is a criterion function. 

Most approaches to the clustering problem are based explicitly or implicitly on 
some kind of criterion function that measures the deviation of units from a selected 
description or representative of its cluster. Usually P{C) takes the form 

W = E Ed{X,Rc) 

cecxec 

where Rc is a representative of cluster C and d a dissimilarity (Brucker, 1978). 

The cluster representatives usually consist of variable-wise summaries of vari- 
able values over the cluster, ranging from central values (mean, median, mode), 
(min, max) intervals (symbolic objects; Diday 1997), Tukey’s (1977) box-and- 
whiskers plots, to detailed distribution (histogram or fitted curve). 

In this paper we investigate a description satisfying two requirements: 
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• it should require a fixed space per variable; 

• it should be compatible with merging of clusters - knowing the description of 
two clusters we can, without additional information, produce the description 
of their union. 

Note that only some of the cluster descriptions are compatible with merging, for 
example mean (as sum and number of units) for numerical variables and (min, max) 
intervals for ordinal variables. 



2. A cluster representative compatible with merging 

In our approach a cluster representative is composed from representatives of each 
variable. They are formed, depending on the type of the scale in which a variable 
is measured, in the following way. Let {Vi,i = l,...,/c}bea partition of the range 
of values of variable V. Then we define for a cluster C the sets 



Q(z, C; L) - {X G C : V{X) G FJ, i - 1, . . . , fc 



where V (X) denotes the value of variable V on unit X. 

In the case of an ordinal variable V (numerical scales are a special case of ordi- 
nal scales) the partition {Vi,i = usually consists of intervals determined 

by selected threshold values to < < ^2 < ^3 < * • • < h-i < tk, U = inf F, 

tk = sup F. 

For nominal variables we can obtain the partition, for example, by selecting k — 1 
values ti, t 2 , t 3 , . . . , t^;_i from the range of variable F (usually the most frequent 
values on E) and setting F = {tj, z = 1, . . . , A: — 1; and putting all the remaining 
values in class F^. 

Using these sets we can introduce the concentrations 
g(z,C;F) -cardQ(z,C;F) 
and relative concentrations 



p{i,c-,v) 



V) 

cardC 



It holds 

J2p{^,c■v) = l 

i=l 

The description of variable F over C is the vector of concentrations of F. 
Note that in the special case C = {X} we have 



p{i,c-,v) = 



1 XeQ(i,C;V) 
0 otherwise 
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Similarly we can consider a missing value on y for unit X by setting {X}; V") = 
z = 1 , . . . , A:. Using concentrations we can describe also symbolic data. 

It is easy to see that for two clusters Ci and C2, n C2 = 0 , we have 

Q(z, Cl U C2; V) = Q{i, CuV)U Q{i, C2; V) 

q{z, Cl U C2; V) = q{i, Ci; V) -f q{i, C2; V) 

The description is compatible with merging. 

The threshold values are usually determined in such a way that, for the given set 
of units E (or the space of units £), it holds that p(z, = 

As a compatible description of nominal variable over a cluster C also its range 
V (C) can be used, since we have V (Ci U C2) = V (Ci) U V (C2). 



3. Dissimilarity 

Most clustering methods are based on some dissimilarity between clusters or be- 
tween a unit and a cluster . For our descriptions we define 



d{C^, C 2 -, V) = i ^ \p{i, CuV)- p{i, C2; 1^)1 

^ i=l 



and in the case of nominal variable described by set of values 



d{C^,C 2 -,V) 



card(l/(Ci)©F(C2)) 
card iV{Ci)UV{C2))’ 



where A®B = AuB-Af]B. 

We shall use the abbreviation d{X, C; V) = d{{X}, C; V). In both cases it can be 
shown that 



• d(C 7 i, C2; V) is a semidistance on clusters for variable V; i.e. 



1. d(Ci,C2;V) >0 

2 . d(C, C;V )=^0 

3 . d(C:, C2; V) + d{C 2 , C3; V) > d{Cu C3; V) 

. d(Ci,(72;y)e[0,l] 

and for the vector representation also 

A G Q{i, E; V) ^ d{X, C;V) = 1 - p{i, C; V) 

The semidistances on clusters for individual variable can be combined into a semidis- 
tance on clusters by 

m m 

d{CuC2) = Y.^ACi,C2\v,), = “.>0 

i=\ 2=1 

where m is the number of variables and aj weights (Batagelj and Bren, 1995 ); often 
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4. Clustering procedures 

The proposed approach allows us to first recode the original nonhomogenous data 
to a uniform representation by integers - indices of intervals. For the recoded data 
efficient clustering procedures can be built by adapting leaders method (Hartigan, 
1975) or adding clustering method (Batagelj and Mandelj, 1993). In this paper, 
because of limited space, we shall describe only the procedure based on the dynamic 
clusters method (a generalization of the leader method). 

To describe the dynamic clusters method for solving the clustering problem let 
us denote (Diday, 1979; Batagelj, 1985): A a set of representatives; L C A a 
representation; ^ a set of feasible representations; VK : <l> x ^ IRj an extended 
criterion function ; G : $ x ^ ^ a representation function ; F : ^ x ^ ^ a 

clustering function and suppose that the following conditions are satisfied: 

WO. P(C) = minLG^W^(C,L) 

the functions G and F tend to improve (diminish) the value of the extended criterion 
function W : 

Wl. WF(C,G(C,L)) < fF(C,L) 

W 2 . W{F{C, L), L) < W{C, L) 

then the dynamic clusters method can be described by the scheme: 

L:=Lo;C:=Co; 

repeat 

L:-G(C,L) 

C:=F(C,L); 
until the goal is attained 

To this scheme corresponds the sequence = (Cn,Ln),n G IN determined by 
relations 

Ln-i-i — G(Ctt,,L^) and — P(Ctt,, Lyj_^i) 

and the sequence of values of the extended criterion function Un = W (C^, L„). 

Let us assume the following model C = {Ci}i^i, L = {Li}i^i, L{X) = Li : 
X G Ci, and further L = [L{V,), . . . , L{Vm)], L{V) = [5(1, L; L), . . . , s{k, L; V)], 



For 



d{C, L-V) = \j2 C; V) - s{j, L- V) \ 
IL(C,L)- 5 : d(X,L(X)) = 5:p(G„L,) 

xeE iei 



P{C.L)= 

xec 



where 
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we define F{L) = {C'} by 



X e C[ : i = min Argmin{ci(X, Lj) : Lj G L} 

3 

each unit is assigned to the nearest leader; and we define G(C) = {L-} by 

L' = argminp(C, L) 

To solve the last optimization problem we consider 

p(C, L; F) = E L- V) = card(C) - E q{j, C; V)s{j, L; F) 

XGC j=l 



This expression has a minimum value iff the sum has a maximum value. The unique 
symmetric optimal solution is 



s{hL';V) = 



\ ifjeM 
0 otherwise 



where M = {j : q{j, C\ V) — max^ q{i, C; F)} and t = card M. 

Evidently WO, W1 and W2 hold. It holds also P(C) = W{C, G(C)). 

Usually f = 1 - we include in the representative of a cluster the most frequent 
range of values of variable on this cluster. This provides us with very simple inter- 
pretations of clustering results. 

For example, in a clustering of types of cars we obtained among 14 clusters: 
(Remark: SIT is an abbreviation for a Slovenian money called tolar.) 



Cluster 5: (42 units) coupes 

doors = 2, height(mm) G ( - , 1388 ], power(KW) G ( 125, - ), acceleration time(s) 
G ( - , 9.1 ], max speed(km/h) G ( 215 , - ), price(1000 SIT) G ( 6550, - ). 

Cluster 8: (57 units) minivans 

doors = 5, passengers = 7, length(mm) G ( 4555, 4761 ], height(mm) G ( 1490, 
- ), fuel tank capacity(l) G ( 66, 76 ], weight(kg) G ( 1540, - ), wheelbase(mm) G 
( 2730, - ). 

Cluster 9: (193 units) small-cars 

doors = 4, passengers = 5, length(mm) G ( - , 4010 ], width(mm) G ( - , 1680 ], 
luggage capacity(l) G ( -, 279 ], fuel tank capacity(l) G ( -, 48 ], weight(kg) G ( - , 
940 ], power(KW) G ( - , 55 ], torque G ( - , 124 ], brakes = discs/drums, max 
speed(km/h) G ( - , 163 ], price(1000 SIT) G ( - , 2100 ]. 

Cluster 13: (62 units) sport-utility 

doors = 5, height(mm) G ( 1490, - ), fuel tank capacity(l) G ( 76, - ), weight(kg) G 
( 1540, - ), cargo capacity(l) G ( 560, - ), torque G ( 270, - ). 
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For another criterion function 

W^(C,L) - aY,d{Q,U) + J2p(Ci,Li) 

iei iei 

where a is a large constant, the first term can be set to 0 by setting s{j, Li] V) = 
Ci] V), j = 1, . . . , A:. If this defines the function G, and F is defined as in the 
previous case, we obtain a procedure which works very well on real data. Note also 
that for the obtained (local) minimum (C*, L*) it holds 

P(C*) = minH^(C’,L) = W^(C*,G(C*))= ^ p{C,G{C)) 

cec* 



5. Conclusion 

We successfully applied the proposed approach to the dataset of types of cars (1349 
units) and also to some large datasets from AI collection 

http : / / WWW . ics . uci . edu/ ~mlearn/MLRepositorY . html 

An extended version of this paper is available at 

http: //vlado. fmf .uni-1 j . si /pub/cluster/ 
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nology of Slovenia, Project J 1-8532. 



References 

Batagelj, V. (1985). Notes on the dynamic clusters method, in; IV conference on 
applied mathematics, Split, May 28-30, 1984. University of Split, Split, 139- 
146. 

Batagelj, V. & Bren, M. (1995). Comparing Resemblance Measures, Journal of 
Classification, 12, 1, 73-90. 

Batagelj, V. & Mandelj, M. (1993). Adding Clustering Algorithm Based on L-W-J 
Formula, Paper presented at: IFCS 93, Paris, 31.aug-4.sep 1993. 

Brucker, P. (1978). On the complexity of clustering problems. Lecture Notes in 
Economics and Mathematical Systems 175, in: Optimization and Operations 
Research, Proceedings, Bonn. Henn,R., Korte,B., Oettli,W. (Eds.), Springer- 
Verlag, Berlin 1978. 

Diday, E. (1979). Optimisation en classification automatique. Tome l.,2.. INRIA, 
Rocquencourt, (in French). 

Diday, E. (1997). Extracting Information from Extensive Data sets by Symbolic 
Data Analysis, in: Indo-French Workshop on Symbolic Data Analysis and its 
Applications, Paris, 23-24. September 1997, Paris IX, Dauphine, 3-12. 

Hartigan, J.A. (1975). Clustering Algorithms, Wiley, New York. 

Tukey, J.W. (1977). Exploratory Data Analysis, Addison- Wesley, Reading, MA. 




A Critical Approach to Non-Parametric 
Classification of Compositional Data 

J. A. Martin-Fem^dez^ C. Barcelo-Vidal*, V. Pawlowsky-Glahn^ 

’ Dept. d'Informatica i Matematica Aplicada, Escola Politecnica Superior, 
Universitat de Girona, Lluis Santalo, s/n, E- 17071 Girona, Spain 
^ Dept, de Matematica Aplicada III, ETS de Eng. de Gamins, Canals i Ports, 
Universitat Politecnica de Catalunya, Jordi Girona Salgado, 1 i 3, 
E-08034 Barcelona, Spain 

Abstract: The application of hierarchic methods of classification needs to es- 
tablish in advance some or all of the following measures: difference, central 
tendency and dispersion, in accordance with the nature of the data. In this work, 
we present the requirements for these measures when the data set to classify is 
a compositional data set. Specific measures of difference, central tendency and 
dispersion are defined to be used with the most usual non-parametric methods 
of classification. 

Key words: Compositional Data, Cluster analysis. Classification. 



1. Introduction 

Any vector x = (jCj,...,.x^) with non-negative elements representing 

proportions of some whole is subject to the unit-sum-constraint x^-¥...+x^ =1. 
Compositional data, consisting of such vectors of proportions (compositions), 
play an important role in many disciplines. Frequently, some form of statistical 
analysis is essential for the adequate analysis and interpretation of the data. 
Nevertheless, all too often the unit-sum-constraint is either ignored or improp- 
erly incorporated into the statistical modelling giving rise to an erroneous or 
irrelevant analysis. The purpose of this paper is to revise the specific statistical 
requirements of standard hierarchic agglomerative classification methods when 
they are performed on compositional data. 

In the next section we present the ideas proposed by Aitchison (1992) 
about the conditions that have to be satisfied by any distance between two 
compositions, and by any measure of central tendency and dispersion of a com- 
positional data set. Next, we propose a modification of the most standard hier- 
archic agglomerative classification methods to make them suitable for the clas- 
sification of a compositional data set. Finally, we present two examples where 
the proposed methodology is applied to simulated compositional data sets. 
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2. Statistical analysis of compositional data 

If a vector w = e9t^with non-negative components is composi- 
tional, we are implicitly recognising that the total size of the com- 

position is irrelevant. Therefore, a suitable sample space for compositional data 
is the unit simplex 

>0(y = 1,...,D), = l|, 

and any meaningful function / of a composition must be invariant under the 
group of scale transformations; i.e. /(Aw) = /(w) , for every A.>0. Note that 

only functions expressed in terms of ratios of the components of the composi- 
tion satisfy this condition (Aitchison, 1992). 

As an analogy to the role played by the group of translations when the 
sample space is the real space 91^ , Aitchison (1986, Section 2.8) introduces the 
group of perturbations as a means to characterize the difference' between two 
compositions. If we denote the perturbation operation by d , then the perturba- 
tion p = (Pi,...,/^x)) applied to a composition x produces the new com- 
position 




If X, x*^ e ^ are two compositions it is easy to prove that the perturbation 
x*ox’* ={x*i X*o IXD)j'^X*jl Xj , 

moves X tox’*'. 

2. a Distance between two compositions 

The requirements which any scalar measure of distance between two composi- 
tions should verify, according to the definitions given by Aitchison (1992), are 
scale invariance, permutation invariance, subcompositional dominance and 
perturbation invariance. These requirements are sensible, as they acknowledge 
the compositional nature of the data. A feasible distance between two compo- 
sitions x,x’*‘ is given by 

\_ 

2 

( 2 ) 



A(x,x’*') = 










which is equivalent to the distance proposed by Aitchison (1992). Using the 
definition of centred logratio transformation ( clr ) from to given by 
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clr(x)= log 



'' X ^ 



,...,log 



g(xV 



( 3 ) 



where g(x) is the geometric mean of the composition x , it is easy to establish 
that 

A(x,x*) =c/„(clr(x),clr(x*)), (4) 

where represents the euclidean distance. Moreover, since clr(pox) = 
clr(p) + clr(x), for any p,x e S^"' , it is clear that the distance (4) is perturba- 
tion invariant. 



2.b Measure of central tendency of a compositional data set 

If X = |x; = ‘ represents a set of compositions, 

the arithmetic mean X of the data set is usually not representative of the ‘cen- 
tre’ of the set, and neither is compatible with the group of pertmbations. Aitchi- 
son (1997) proposes the geometric mean cen(X) as more representative of the 
central tendency of a compositional data set. It is defined as 

cen(X)= ^^^ - ’ - -’^^\ (5) 

gl+- ••+«£, 

TA V 

where 1 1 JCj, is the geometric mean of the yth component of the data 

set. Figure 1 shows the ternary diagram of a simulated data set (adapted from 
Aitchison, 1986, data set 1) where it can be observed that the geometric mean 
lies inside the bulk of data, while the arithmetic mean lies slightly apart. 

Figure 1: Data set in the simplex (geometric mean = (0.6058, 0.2719, 0.1223) 
and arithmetic mean = (0.54, 0.2756, 0.1844) ) 
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It is easy to prove that cen(poX)= pocen(X) for any perturbation 
p e , and that clr(cen(X)) = clr(X) . Therefore, it will be true that 



A(x,cen(X)) = t/^(clr(x),clr(X)). (6) 

2.C Measure of dispersion of a compositional data set 

It is sensible to assume that any measure of dispersion of a compositional data 
set should be invariant under the group of perturbations. The measure of disper- 
sion defined by Aitchison (1992,1997) satisfies this condition; it is based on the 
trace of the covariance matrix of the centred logratio transformed compositions. 
In accordance with his definition, a measure of total variability of a composi- 
tional data set X can be defined as 



totvar(X)= II log| 



^ X ^ 



7=1 /=1 



Vg(x,)>^ 



-m, 



\ f JC ^ 

where m. = —^^og — (y = 1, , , . , Z)) . It is easy to prove that 



Vg(x,V 

totvar(X)= ^7^clr(X;),clr(X)) = (x,.,cen(X)), 



(7) 



( 8 ) 



which proves that the proposed measure of total variability (7) is compatible 
with the distance defined in (2). It can also be proved that 
totvar(poX)= totvar(X) , for any perturbation p gS^'\ which proves that the 
measure of total variability (7) is invariant under perturbations. 



3. Hierarchic cluster analysis of compositional data 

Before applying any hierarchic method of classification to a data set, it is nec- 
essary to establish in advance some or all of the followang measures: of differ- 
ence, central tendency and dispersion, to be used in accordance with the nature 
of the data. Therefore, if we are using a hierarchic method to classify a compo- 
sitional data set, we have to take into account that all these measures must be 
invariant under the group of scale transformations. It is clear that the definitions 
given in (2), (5) and (8) are scale-invariant, while the euclidean distance is not. 
Therefore, from this point of view, it is wrong to use the euclidean distance 
between two compositions to calculate the matrix of distances associated with 
hierarchic methods like single linkage, complete linkage and average linkage. 
We propose to use the distance defined in (2). By property (4), this distance is 
equivalent to the euclidean distance between the compositions transformed by 
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the centred logratio transformation dr defined in (3). 

Likewise, any method of classification which reduces the distance from 
a composition to a set of compositions to the distance between the composition 
and the centre' of the group, has to take into account the considerations made 
in the previous section 2.b. We propose to use (5) as a definition of the centre’ 
of a set of compositions, in addition to the distance defined in (2). Then, by 
property (6), it is easy to conclude that the centroid classification method can 
be used if it is applied to the data transformed by the centred logratio transfor- 
mation. 

On the other hand, the well-known method of Ward is a hierarchic 
method which uses the measure of dispersion to classify the data. In essence, 
this method is based on the concept of variability on a cluster C. This variabil- 
ity is defined by the sum Everitt, 1993, Section 5.2), where 

C denotes the centre of the class. When the data set is compositional, we sug- 
gest replacing the squared Euclidean distance which appears in the 

previous sum, by A^^(x,cen(C)) defined in (2). Then, by definition (7) and 

property (8), the variability of a cluster C of compositions will be equal to 
totvar(C). Thus, modifying the method of Ward to make it suitable for the 
classification of a compositional data set X is equivalent to applying the stan- 
dard procedure to clr(X). 



4. Two examples 

Consider the 50 points plotted on a ternary diagram in Figure 2a, corresponding 
to a simulated compositional data set XI characterised by three components. 
Samples belong to two groups obtained one from the other by the application of 
a perturbation. Figure 3a shows the ternary diagram of a second simulated data 
set X2 (adapted from Aitchison, 1986, data set 1) with 50 elements, which has 
been generated and labelled in a similar manner. Figures 2b-3b show the plots 
in 91^ of the dr- transformed data set clr(Xl) and clr(X2), respectively. In each 
case, original groups are separated by a line and the groups resulting of a cluster 
method are distinguished by a different symbol. 

As can be observed, original groups show no overlapping neither in 
nor in 91^ but, while Figures 3a and 3b show also a clear visual separation of 
the two groups, this is not the case for the data represented in Figure 2a and 2b. 
For the sake of comparison, different standard classification methods have been 
applied to the four sets using the Euclidean distance. Misclassification rates are 
listed in Table 1. 
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Figure 2: Example 1 (a) Plot in the simplex (groups from Ward^s method) (b) 
dr-transformed set (groups obtained using single linkage) ); dassification re- 
sults distinguished by symbols ' and *o 




(a) (b) 



Figure 3: Example 2 (a) Plot in the simplex (groups obtained using single link- 
(^) dr-transformed set (groups from Ward's method); dassification re- 
sults distinguished by symbols '+ ' and 'o'. 






Example 1 


Example 2 


Method 


XI 


c/KXl) 


X2 


C/KX2) 


Single Linkage 


48% 


12% 


0% 


8% 


Ward 


50% 


6% 


8% 


50% 


Complete Linkage 


22% 


6% 


42% 


50% 


Centroid 


48% 


6% 


42% 


50% 


Average Linkage 


50% 


6% 


28% 


50% 



As could be expected, a poor classification power is obtained for XI, 
because the two groups are dose from an euclidean point of view (only the 
complete linkage method gives an acceptable classification). However the clas- 
sification power seems to be reasonable for c/r(Xl), because only the three 
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elements close to the border are misclassified. For dr (XI) the poorest result is 
obtained when the single linkage is used. Figure 4 shows the associated den- 
drogram. The samples of the first group are labelled from 1 to 25, while the 
others are labelled from 26 to 50. This dendrogram is similar to the den- 
drograms associated to the others methods whit only one difference: composi- 
tions labelled by 34, 47 and 48 are considered by the single linkage method as 
separated groups (see Figure 2(b)). 

Figure 4: Dendrogram of the single linkage method applied to dr (XI). 



D ista nee 




Results are more striking for X2 and c/r(X2), given the clear 
separation between groups both in the simplex and in real space: only single 
linkage and Ward's methods show a high classification power for X2, while 
complete linkage, centroid and average linkage methods have a poor classifica- 
tion power. They work still worse when applied to c/r(X2) because the two 
groups are parallel and elongated. Figiue 5(b) shows the associated dendrogram 
when the single linkage method is applied to the data set c/r(X2). Numbers 
from 1 to 25 correspond to the first group and 26 to 50 to the second. It can be 
observed that the almost all samples are well classified: only observations la- 
belled 10 and 17, and observations 35 and 42 are misclassified as separated 
groups (see Figure 5(a)). 

Figure 5: Single linkage method applied to dr (X2): (a) Plot (b) Dendrogram 
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Distance 




Observations 



(b) 



5. Conclusions 

• There are theoretical objections to the application of the standard hierarchic 
classification methods to compositional data sets because they don't take into 
account the nature of this kind of data. 

• To classify a compositional data set, we suggest adapting the usual hierar- 
chic methods using the definitions of distance, centre and variability defined 
in (2), (5), and (7), which are compatible with the compositional nature of 
the data. This is equivalent to applying standard methods to the centred 
logratio transformed data set. 

• Further research is needed to understand more thoroughly the performance 
of modified standard parametric classification methods when they are ap- 
plied to compositional data. 
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Abstract: We describe an effective approach to object or feature detection 
in point patterns via noise modeling. This is based on use of a redundant 
or non-pyramidal wavelet transform. Noise modeling is based on a Poisson 
process. We illustrate this new method with a range of examples. We use the 
close relationship between image (pixelated) and point representations to 
achieve the result of a clustering method with constant-time computational 
cost. We then proceed to generalize this method for high- dimensional data. 
Using a dataset of very well-known structure as a test case, we show proof of 
concept for this approach to analysis of high-dimensional boolean hyperlink 
datasets. 

Keywords: Cluster Analysis, Wavelet Transform, Multiresolution Analy- 
sis. 



1 Point Pattern Clustering 

Point pattern clustering (Murtagh and Starck, 1998) has constituted one of 
major strands in cluster analysis. We will describe (i) a multiscale approach 
which is computationally very efficient, and (ii) a direct treatment of noise 
and clutter which leads to improved cluster detection. 

The a trous wavelet transform (Shensa, 1992; Murtagh, 1998; Starck et 
ah, 1998) is redundant and non-orthogonal. It is efficient, being of linear 
computation cost in terms of the size of the input data. Treating a point 
pattern set as a 2-dimensional image allows point occurrences to be accu- 
mulated, and allows an image of fixed dimensionality to represent the user 
field of view. Given such a two-dimensional image of fixed dimensionality, 
there is no computational dependence on the actual number of points. This 
analysis method is therefore of 0(1) computational cost in terms of the 
point patterns examined. 

Although other analysis frameworks could be easily envisaged, we have 
applied this approach in the following way. We have visualized the data, 
examining each resolution level. We have retained some particular level. 
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with greatest interpretational value. Contiguous regions of a robustly- 
thresholded version of such a resolution level are examined. Cluster prop- 
erties are measured. 

Data are usually noisy. The wavelet coefficients will therefore be noisy 
too. We have determined significance levels for detection of signal at each 
resolution level, on the assumption of low-count Poisson noise. This allows 
noise filtering and reconstruction of the data. 

We will show examples of this approach, including the finding of scarcely 
perceptible point pattern clusters in a noisy background (e.g. 550 ‘signal’ 
patterns with 40,000 ‘noise’ points in the background). High-quality recov- 
ery of known clusters is shown. 



2 Point Patterns and Wavelet Transforms 

Given a planar point pattern, a 2-dimensional image is created by: 

1. Considering a point at (x^y) as defining the value one at that point, 
yielding the tuple (x, y, 1). 

2. Projection onto a plane by (i) using a regular discrete grid (an image) 
and (ii) assigning the contribution of points to the image pixels by 
means of the interpolation function, </>, used by the chosen wavelet 
transform algorithm (in our case, the a trous algorithm with a Bs 
spline). 

3. The a trous algorithm is applied to the resulting image Based on 
a noise model for the original image (i.e. tuples (x,y, 1)), significant 
structures are detected at each resolution level. 

The wavelet transform used is the d trous (“with holes”) method. It is a 
redundant (i.e. non-pyramidal) method and has computational cost which 
is linear as a function of the number of pixels in the input data. 

In a wavelet transform, a series of transformations of an image is gen- 
erated, providing a resolution-related set of “views” of the image. The 
properties satisfied by a wavelet transform, and in particular the a trous 
wavelet transform (so called because of the interlaced convolution used in 
successive levels: see step 2 of the algorithm below) are further discussed in 
Bijaoui et al. (1994). 

A summary of the a trous wavelet transform is as follows. Index k ranges 
over all pixels. 

1. Initialize z to 0, starting with an image c,(fc). 

2. Increment z, and carry out a discrete convolution of the data Ci-i{k) 
using a filter h (see below). The distance between a central pixel and 
adjacent ones is 2*“^. 
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3. From this smoothing we obtain the discrete wavelet transform, Wi[k) = 

Ci_i(fc) - Ci{k). 

4. If i is less than the number p of resolution levels wanted, return to 
step 2 . 

The set W = {ii;o, Cp}, where Cp is a residual, represents the 

wavelet transform of the data. The filter h is based on a S 3 spline: the 
5 x 5 filter used and other implementation details such as treatment of 
boundaries are given in Starck and Murtagh (1994). We have the following 
additive decomposition of the input image: 

Co{k) ^ Cp + Y^Wi{k) (1) 

i=l 



3 Poisson Noise with Few Counts 

A large number of counts would permit us to model the Poisson character- 
istics of the noise with a Gaussian distribution, and for the latter, Murtagh 
et al. (1995) may be referred to. The small count situation is of interest 
to us here. It holds for all parts of point pattern images and there is no 
alternative for the sparsely-populated regions. 

A wavelet coefficient at a given position and at a given scale j is 

( 2 ) 

fceAT ^ ^ 

where K is the support of the wavelet function ^ and Uk is the number of 
events which contribute to Wj{x^y)^ i.e. the number of events included in 
the support of the dilated wavelet centred in (x,y). 

If a wavelet coefficient Wj{x^y) is due to noise, it can be considered as 
a realization of the sum ^keK of independent random variables with the 
same distribution as that of the wavelet function [uk being the number of 
events used for the calculation of Wj{x^y)). This allows comparison of the 
wavelet coefficients of the data with the values which can be taken by the 
sum of n independent variables. 

The distribution of one event in wavelet space is then directly given by 
the histogram Hi of the wavelet 7 /^. As we consider independent events, the 
distribution of a coefficient Wn (note the changed subscripting for for 
convenience) related to n events is given by n autoconvolutions of H\: 

Hn = Hi Hi ^ ... (g) Hi (3) 

For a large number of events, Hn converges to a Gaussian. 
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4 Multiresolution Support 

The multiresolution support (Starck et ah, 1995) will be obtained by detect- 
ing the significant coefficients at each scale. The multiresolution support is 
defined by: 



M{j,x,y) = 



1 if y) is significant 

0 if Wj(x^y) is not significant 



( 4 ) 



We will say that a multiresolution support of an image describes in a logical 
or boolean way if an image I contains information at a given scale j and at 
a given position (x^y). The algorithm to create the multiresolution support 
is as follows: 



1. Compute the wavelet transform of the image. 

2. Estimate the noise standard deviation at each scale. Deduce the sta- 
tistically significant level at each scale. 

3. Booleanization of each scale leads to the multiresolution support. 

4. Modification using a priori knowledge is carried out if desired. 



5 Example 

Fig. 1 shows two Gaussian clusters designed with centers (64, 64) and (190, 190) 
and with standard deviations in x and y directions respectively (10,20) and 
(18, 10). In the first (lower) of these clusters, there are 300 points, and there 
are 250 in the second. Background Poisson clutter was provided by 40,000 
points. Even with this amount of noise, the recovery properties of these 
clusters, based on the multiresolution support, were very good. We found 
the centroid values of the “island” regions to be respectively (61,70) and 
(187, 186) which are good fits to the design values. The standard deviations 
in X and y were found to be respectively (6.4, 8.1) and (7.2, 6.4), again rea- 
sonable fits to the input data given that there are around eighty times more 
“noise” points present compared to “signal points”. 



6 High-Dimensional Data Spaces 

Information retrieval and related areas require solution methods for data in 
spaces of dimensionality of many tens of thousands. Matrix reordering tech- 
niques are closely related to eigen-reduction methods (Murtagh, 1985; Berry 
et ah, 1996). Matrix reordering can be used to transform high-dimensional 
data arrays to a canonical form. On the basis of a data array in canonical 
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Figure 1: The image showing 550 “signal” points, i.e. two Gaussian-shaped 
clusters, with in addition 40,000 Poisson noise points added. Details of 
recovery of the cluster properties are discussed in the text. 

form, we search for structure using an efficient multiscale approach - using 
the same wavelet transform approach as above, which handles sparse data 
effectively. 

We explore how this works in practice, using a dataset with very well- 
known structure. This is booleanized to emulate a document /keyword or 
other high-dimensional dependency array in the following way. We take 
Fisher’s (and Anderson’s) iris data which consists of a 150 x 4 real-valued 
array, with 3 well-known classes. We booleanize this to create a binary 
dataset. This is done by roughly multiplying each value by 100, truncating 
and rounding each variable to 1 or 0. A 4-dimensional dataset is recoded in 
this manner to a 147-dimensional dataset. Each observation has exactly four 
1- values. The total number of 1- values is 600. The total number of values 
in the array is 22050. We note again that the use of such an array is purely 
to provide a test framework of known properties. To quickly check on the 
change of information content in moving from a 4- to a 147-dimensional em- 
bedding of the 150 observations, we can use principal components analysis 
and correspondence analysis, respectively. 

With such a boolean array, we seek a canonical ordering of the rows and 




Figure 2: Wavelet transform of Fisher data. Left to right, from upper left: 
resolution levels 1, 2, 3, 4, 5 and the residual. 

columns. For the small dataset considered here, a convenient ordering is 
that which reflects the ordering of the first factor in correspondence analysis. 

We wavelet-transform this row- and column-permuted array. The a trous 
method is used. The wavelet transform is shown in Fig. 2. Resolution level 
5 is the most interesting output scale relative to the clustering - contiguous 
regions - indicated. We note that of necessity such a resolution level is of 
zero mean, which facilitates thresholding. 

We label the contiguous regions of the multiresolution support at resolu- 
tion level 5. The corresponding observation sequence numbers are read off, 
taking account of the permuting previously carried out. The findings are 
that Fisher’s class 1 is nearly completely recovered (this corresponds to the 
large positive region in the upper right-hand corner). This is of course in 
line with the known fact that this class is isolated from the others. Fisher 
classes 2 and 3 are less well resolved, in line with the correspondence anal- 
ysis principal plane obtained from analysis of the 147-dimensional boolean 
data. 

We conclude that this novel approach to clustering is reliable. For n 
rows and m columns, it has linear computational cost in terms of these 
dimensions, and hence is 0{nm). The constant of proportionality is not 
important. As noted before, the computational complexity in terms of the 
actual array contents (one- values or non-boolean values) is constant, 0(1). 
These computational results may be contrasted with other O(n^) methods, 
or methods which are of better computational complexity but which lack 
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robustness and stability. 

We are currently applying this new clustering method to the analysis of a 
12, 025 X 9, 778 article-by-link matrix, relating to articles by cross-references 
from the Condensed Columbia Encyclopedia (1989, 2nd ed., online version). 
This boolean array has 33,396 links (nonzeros) = 0.0284% density. Having 
effective methods for the analysis of such data is important in practice since 
voluminous data of this kind arise naturally from communications networks, 
and from the entertainment and communications media. With the ability to 
determine clusters quickly and effectively, subsequent phases of the analysis 
may, for example, entail finer-tuned inferential modeling of the individual 
clusters. 
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Abstract: In the present paper, the conditions under which Simpson’s paradox does 
not occur are discussed for various cases. These conditions are first obtained fi*om the 
descriptive point of view and then on the assumption of prior probability distributions 
of parameters. The robustness of the results is discussed with respect to the prior 
probability distributions. Practically, the result is given as the magnitude of odds ratio 
(or relative risk), i.e., Simpson’s paradox does not occur if the odds ratio is more or 
less than a certain values, depending on various cases. 

Keywords: Simpson's paradox, conditions of non-paradox, descriptive approach, 
prior probability distribution of a parameter, robustness of solution, Monte Carlo 
solution. 



1. Introduction 

Simpson’s or Yule’s paradox (Simpson, 1951), treated in the present paper, is well 
known and element^ but nonetheless very important, even in modem data analysis. 
The paradox sometimes causes serious problems for interpretation of results. These 
misleading eflFects have been illustrated in many epidemiological publications 
(Miettinen, 1976; Kleinbaum, 1982). Simpson’s paradox has also been discussed by 
other investigators; for example, Shapro (1982), Geng (1992), Hayashi (1993), 
Hintzman (1993), Hand (1994), and Vo^ (1995). 

Althou^ many methods have been proposed for adjusting the effects of confounding 
factors in experimental design, most of these efforts center on statistical tests in the 
case of known confounding factors. However, we cannot ignore the effects of 
unknown confounding factors. What must be noticed is that the existence of latent 
classes (groups) having the reverse or independent association from a given 
contingency table yields Simpson’s paradox. Therefore, examining the condition under 
which such latent groups exist offers a key to understanding the occurrence of 
Simpson’s paradox. 

In the present paper, the conditions under which Simpson’s paradox does not occur 
are discussed. In order to examine these conditions, we tried to find conditions 
exhibiting the nonexistence of two latent groups with independent association, using 
two parameters relating to those latent groups, because independent association is 
regarded as the extreme case of inverse association. 
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2. Explanation by Simple Examples 

Suppose that a contingency table {Table la) is given in a group where A and B are 
categories in a factor (we call this as I), and + and are categories in an outcome (we 
call this as II). 



Table la: Example 1 




Table lb: Proportions for Example 1 




From this table, we calculated proportions of + and - by A or B {Table lb). From this 
analysis, we usually consider that outcome + in A is larger than outcome + in B. Here 
we suppose that Table la is obtained by the sum of the contingency tables in two 
latent groups (Table 2a) which are derived by the division or classification of the 
original group. It can be said, too, that the original contingency table is given by the 
sum of the contingency tables in two latent groups, Gi and 62 . 



Table 2a: Frequency in two latent groups 



Grgroup Grgroup 



K\ 


+ 


- 


Total 




+ 


- 


Total 


A 


2800 


1200 


4000 


A 


200 


800 


1000 


B 


800 


200 


1000 


B 


1200 


2800 


4000 


Total 


3000 


1400 


5000 


Total 


1400 


3000 


5000 



From these tables {Table 2d), we calculated proportions of + and - by A or B in Gi 
and G 2 {Table 2b). 



Table 2b: Proportions in each group 



Grgroup 



Grgroup 
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In these two groups, we find that outcome + in A is smaller than outcome + in B. That 
is, the inverse result has been obtained in both groups. A discrepancy between the total 
and the parts has appeared. 

Here we consider, by a simple example, the condition under which such inconsistency 
does not occur. Notwithstanding that the extreme case of inverse association, we 
take up the following example {Table 3d). 



Table 3a: Original contingency table 











+ 


Total 


A 


20 4980 


5000 


B 


10 4990 


5000 


Total 


30 9970 


10000 



Table 3b: Contingency table in Gr Table 3c: Contingency table in G 2 - 
group: no association group: no association 





+ 


- 


Total 




+ 


- 


Total 


A 


4000S 


4000(1 -S) 


4000 


A 


lOOOT 


lOOO(l-T) 


1000 


B 


lOOOS 


lOOO(l-S) 


1000 


B 


4000T 


4000(1 -T) 


4000 


Total 


5000S 


5000(1 -S) 


5000 


Total 


5000T 


5000(1 -T) 


5000 



Note: S and T are the proportions of outcome + by A and B in each group. 



From these tables, the following relations hold; 

4000S + 1000 T = 20, 
lOOOS + 4000 T = 10. 

From these equations, we have 

S = 7/1500, 

T = 2/1500. 

In the total (original group), we have the result that outcome + by A is larger than 
outcome + by B, while we have no association between I and II in both groups. For 
some generalizations, we consider 201^ (ko ^0) instead of 10 in Table 3a in the original 
group. What are ko’s such that this inconsistency does not occur? If no association 
case dose not occur, much less inverse association. 

First, we calculate the conditions under which such inconsistency does occur. From 
Table 3b and Table 3c, we have the simultaneous linear equations: 
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4000 S + 1000 T- 20 
1000 S + 40001 = 20 ko 

and the solution 750S = 4- ko. 

From 1 ^ S ^0 and ko ^0, the inequalities 

750^4- ko^O 

i.e. 4^ko^O (1) 

are obtained. We have another solution 
750T = 4ko-l. 

From 1 ^T^O and ko^O, 

750^4ko-1^0 

i.e. 751/4>ko^l/4 (2) 

From (1) and (2), the range of ko in which the inconsistency does occur is calculated as 
4>ko^l/4. Thus, if ko>4 0^=4) or ko<l/4 (Kmin=l/4), no inconsistency occurs, i.e., 
Simpson’s paradox does not occur. Thus some idea of our treatment of Simpson’s 
paradox has been given. 



3. Descriptive Approach 

We generalize the example shown in Section 2. Before beginning the discussion, we 
would like to explain the notation. 

(i) Original group and two latent groups with classification based on unknown 
criteria 

The notations are described in Figure 1 and Figure 2. 
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(ii) Restriction of Parameters 

The conditions that these parameters must satisfy are shown below. 

(a) Fundamental conditions 

l^P^O, l^S^Oand l^T^O. 

(b) Restrictions among parameters from the fundamental conditions and Figure 

2 . 

l>d>0, that is, l>(C-bg)/(l-b)>0 

b>0, g>0, ko>0 

1>P>0, 1>S>0, 1>T>0. 

(c) Restrictions on cell size 

For practical use, the restriction 0<S<1 and 0<T<1 is assumed and we set the least cell 
frequency L ( ^ 1) such that 



‘N, 2 , 'N 21 , ‘N 22 , "N„, 'N, 2 , 'N 21 , 'N 22 ^ L. 



(iii) Result 

Under restrictions (a), (b), and (c), the values of ko were calculated for given values P, 
C, and N, and unknown parameters b and g with pre-assigned constant values. The 
results are shown in Yamaoka (1996). 



4. Solution by Monte Carlo Method 

Pre-assigned values for b and g were used in Section 3. That is, in the previous study, 
parameters b and g were moving with constant intervals. However, the values of b and 
g are essentially unknown. We therefore assume that these parameters have some prior 
probability distribution (hereinafter referred to as PPD) and examine how the solutions 
change depending on the PPD. When the solutions vary notably depending on the PPD, 
we must carefully discuss the solutions. On the other hand, if they prove to be robust 
to the PPD, this brings practical usefulness to the solutions. To examine the robustness 
of the solutions to several PPD’s is the subject of the present study. For simplicity, 
restriction (c) is not used. 

(i) Procedure 

The value ko changes in accordance with changes in the parameters b and g, which 
change in accordance with the decomposing condition. It should be noted that Nu and 
marginal distributions were fixed. To specify the meaning of the value ko calculated 
from the original table and that from the situation in Figure 2, we refer to the latter as 
k. The value k has the minimum and maximum values under the conditions of the 
parameters b and g with permission of some restrictions. Thus, the strength of the 
association can be examined by a comparison of the given value ko with the minimum 
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value of the value k (K^) or the maximum value of the value k (Kmax). In such a 
situation, Kmin is obtained when k<l, and K^ax is obtained when k>l. 

From (a) and (b) in Section 3 we have restrictions bg < C, C<l-b+bg, and 0<kP<l 
from nature of a contingency table. In this analysis, we have the following solutions 
from the formulas (1) and (2); 



when C>g: 

Kmax = minimum 




b(C-g)^C^(\-bHg-C) C(l-g) 
P{C-bg)(l-C) ’ ^(1-C) 

C(l-*+*^-C) ^-C+CP(1-^)| 

(c-bgXi-c) gP(i-c) J ’ 



when^>C>6^: 
Kmax = minimum 



C(l-*+6^-C) 

(c-bgxi-c) ’ 



^-C+CP(l-^)' 
<■ 

^P(l-C) . ’ 



{b(C-g)+CP{l-b+bg.C) C(l-g) 

Kmin = maximum J , 

P(c-ft^xi-c) ^(1-c) 



In the case that C^bg, we have no solution. 



( 3 ) 



In order to examine the robustness, Monte Carlo method is a very useful strategy. In 
the Monte Carlo method, we assume four types of PPD and generate random numbers. 
That is, instead of moving b and g with constant interval, we assume four types of 
distributions, Ul, U2, U3 and U4, shown as below, on b and g, and calculate medians 
and percentiles of (Kmin)io and (Kmax)up (calculated as the 5th percentile for (Kmin)io and 
the 95th percentile for (Kmax)up). 

Four types of PPD were considered corresponding to subsequent distributions: 

Ul: uniform distribution on [0,1] 

U2-U4: triangle distribution on [0,1] 

U2: isosceles triangle (A) 

U3 : right angle triangle with mode 1 ( ^ ) 

U4: right an^e trian^e with mode 0 ( [^ ). (4) 



(ii) Simulation Results 
(a) Robustness 

We have sixteen combination for PPD’s of the parameters b and g. For each, the 
medians of Kmin and Kmax and the percentiles of (Kmin)io and (Kmax)up were calculated. 
The random numbers were generated 20000 times for each. 

Here, we concentrated our attention on the differences in the solutions among the PPD 
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types. As a whole, the diflferences between the medians among the PPD types (except 
four combination and those of (Kmax)up) were not very large; i.e., robustness was 
generally confirmed. However, as (IQnin)io‘s are scattered, tins information must be 
treated carefully. Here, for example, solutions of the case for P=0. 1 is shown in Figure 
5. 

Figure 3. Medians and percentiles of (Kmir)io cmd (Kmcv)up ^ith given values C as the 
horizontal axis and the four distribution types as the corresponding lines, 
for P= 0.1. 

Thin line = medians, dotted line = percentiles. 





(b) Case of Mixed Distribution 

Mixed distribution cases were calculated under the conditions of mutual independence 
between b and g with respect to the occurrence of PPD’ s with equal probability 1/4, 
respectively. That is, b and g were generated using random numbers from the mixed 
distribution of the four types of PPD with 1/4 probability for each. The random 
numbers were generated 50000 times. 

Most of the medians for K„in varied around 0.5 (except for the case of small P value), 
and the percentiles of (Kmin)io varied around 0.05. On the other hand, for Kn,ax, the 
medians varied around 2 and (Kmax)up varied around 8. The behavior of solutions are 
being stable in this mked case. For example, solutions of the case for P=0. 1 is shown 
m Figure 4. 

Figure 4. Medians and percentiles of (Kmir)io cmd (Knuo)up ^ith given values C as the 
horizontal axis for mixed distribution, for P=0. 1. 

Thin line = medians, broken line = percentiles. 
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(c) Discussion 

The analysis presented overall observation of the Kmin, K„ax, and C. These results 
suggested that the boundary line (for the median) of the occurrence of Simpson’s 
paradox may exist around ko=0.5 (or 2). However, in order to deny the occurrence of 
Simpson’s paradox, without regard to any restrictions on the condition of the classified 
subgroups, ko should be generally smaller than approximately 0.05 or larger than 8. 



5. Concluding Remarks 

The importance of division (classification) of an ori^nal group by known or unknowm 
criteria has been shown, throu^ the treatment of Simpson’s paradox, in data analysis 
from the exploratory point of view. 

In the case of classification by known criteria, the problem is not too difficult, even 
though the results are very interesting, whereas in the case of classification by 
unknown criteria, i.e. the existence of latent groups (classes), the problems is very 
complicated and generally quite difficult to solve. Such problems have serious 
implications for data science. In the present paper, a fiindamental problem called 
Simpson’s paradox has been treated and discussed under the concept of data science. 
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Abstract: A debate over the use of consensus methods for combining trees, as 
opposed to combination of data sets, is currently growing in the field of 
phylogenetics. Resolution of this question is greatly influenced by the consensus 
method employed (i.e., for unweighted or weighted trees). In the present paper, my 
main objective is to review some of these methods that allow for the combination of 
weighted trees. These consensus with branch lengths will be compared to some of 
the commonly used consensus methods for ^-trees. Finally, I will extend the results 
to supertrees. 

Key words: Consensus Trees, Weighted Trees, Phylogenies, Supertrees. 



1. Introduction 

To combine, or not to combine? This is one of the hottest debate in phylogenetics 
today (de Queiroz et al., 1995; Huelsenbeck et al., 1996). One the one hand, 
proponents of total evidence (i.e., character congruence) claim that all available data 
must be combined prior to phylogenetic reconstruction (e.g., Kluge, 1989). On the 
other hand, proponents of consensus (i .e , taxonomic congruence) prefer to combine 
the phylogenies derived from independent data sets (e.g., Miyamoto and Fitch, 
1995). Resolution of this dilemma greatly depends on the consensus method 
selected, however, and how it fares against total evidence. In particular, it is striking 
that the advocates of character congruence have used conservative techniques to 
assess consensus (i.e., the strict consensus); when the consensus tree was less 
resolved than, or different from, the total -evidence phylogeny, the latter approach 
was declared superior (Barrett et al., 1991; de Queiroz, 1993). Thus, the important 
question of what consensus method to chose remains central to this debate. 
Recently, it was shown that when branch lengths are accounted for when combining 
trees, total -evidence and consensus phylogenies do not differ much (Lapointe, 
1997; Lapointe et al., submitted). In this paper, I will review the consensus 
methods currently available to combine trees while taking into account branch 
lengths. I will use examples to illustrate the differences between these methods and 
some other consensus techniques that ignore branch lengths. I will present methods 
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to combine ultrametfic trees. Then, the case of additive trees will be considered, and 
a new consensus method based on spectral decomposition will be introduced. 



2. Consensus methods for unweighted trees 

Let us consider a simple example involving three dendrograms derived from 
different genes (Fig. 1). These phylogenies differ with respect to topological 
relationships and path-lengths among the species considered. In order to reconcile 
these results, two options are available: the first is to combine the data to obtain a 
"total -evidence" phylogeny (Fig. Id), whereas the other is to combine the trees 
using some consensus method. 



Fig. 1. Three ultrametric phylogenies derived from different genes: (a) 12s rRNA, 
(b) IRBP, (c) von Wide brand's factor (see Kirsch et al, submitted). The total- 
evidence tree (d) was obtained by combining all data in one phylogenetic analysis. 
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The choice of a consensus technique is again central to this problem as more 
conservative methods will usually produce less resolved trees. Indeed, a strict 
consensus (Sokal and Rohlf, 1981) of the three phylogenies is barely resolved (Fig. 
2a), whereas the consensus obtained with the majority-rule method (Margush and 
McMorris, 1981) is somewhat better (Fig. 2b). The Adams (1972) consensus tree 
is even more resolved (Fig. 2c), but it contains clades (e.g., {Rabbit, Cat}) which 
were not observed in any of the original trees (see Fig. 1). 
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Fig. 2. (a) strict, (b), majority-rule, and (c) Adams consensus of the trees in Fig. I 
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In this example, all of the consensus trees convey less phylogenetic information 
than the total -evidence tree (Fig. Id); thus character congruence should be preferred 
here to reconcile the results of otherwise inconsistent phylogenies (Fig. 1). But this 
conclusion may not always be true. I will now show that consensus methods which 
take into account branch lengths often produce trees which are closer to total- 
evidence phylogenies than those produced by consensus methods that ignore these 
branch lengths. 



2. Consensus methods for ultrametric trees 

Some of the earlier methods proposed to compute consensus with branch lengths 
were simple extension of consensus methods for unweighted trees. These 
techniques were also restricted to ultrametric trees (i.e., dendrograms). Three 
distinct methods will be presented here; the intersection method of Stinebrickner 
(1984b), the Euclidean consensus method of Lefkovitch (1985), and the average 
procedure of Cucumel (1990), as modified by Lapointe and Cucumel (1997). 

2.1. Intersection method 

The method proposed by Stinebrickner (1984b) to combine dendrograms is a 
modification of intersection methods for «-trees (Adams, 1972; Neumann, 1983; 
Stinebrickner, 1984a). In short, the consensus tree is obtained by taking the 
maximum ultrametric distance observed over all dendrograms, for every pair of 
objects; the resulting pairwise distance matrix is also ultrametric and uniquely 
represents the consensus dendrogram. In the present example, the consensus 
obtained with Stinebrickner’ s method (Fig. 3a) is much more resolved than any of 
the consensus trees constructed with topological techniques (see Fig. 2). 
Nevertheless, that tree still differs quite a lot from the total -evidence phylogeny 
(Fig. Id). Also, this method can produce clades which were not encountered in any 
of the individual phylogenies (e g., {cat, human}). 
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Fig. 3. (a) Intersection (Stinebrickner, 1984b), (b). Euclidean (Lejkovitch, 1985) 
and (c) average consensus (Lapointe and Cucumel, 1997) of the trees in Fig. 1. 
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2.2. Euclidean method 

The consensus method proposed by Lefkovitch (1985) is derived from his work on 
principal-coordinates representations of dendrograms (Lefkovitch, 1976; 1978). It 
uses the Euclidean representation of ultrametric distances, from which consensus 
coordinates can be computed. The consensus obtained in this fashion is the 
dendrogram that minimizes the sum of the squared distances from the objects in 
each of the separate representations to their positions in the consensus. 
Interestingly, one can obtain consensus trees containing overlapping groups with 
such an approach (see Lefkovitch, 1985). In the present example, the Euclidean 
consensus is a strictly bifurcating dendrograms (Fig. 3b). However, it is quite 
different from the consensus tree obtained with Stinebrickner’s method (Fig. 3a). 

2.3. Average method 

This method initially introduced by Cucumel (1990; see also Vichi, 1993; Lapointe 
and Cucumel, 1997) is related to Lefkovitch's Euclidean consensus. In the case of 
dendrograms, the average consensus is the ultrametric tree that minimizes the sum 
of the squared distances from the trees in the input profile to the output consensus 
solution. That tree is obtained by applying a least-squares algorithm to a matrix of 
average path-length distances, computed over all dendrograms combined. In the 
present example, the average consensus tree (Fig. 3c) is the closest from the total- 
evidence tree (Fig. Id), with only one different node (i.e., {Bat, Cat}). 



3. Consensus methods for additive trees 

When one is dealing with phylogenies, imposing the ultrametric constraint is to 
assume that every lineage is evolving at the same rate (i.e., the molecular clock 
hypothesis is satisfied. Fig. 1). In general, additive trees satisfying the four-point 
condition are preferred to obtain a better representation of the data matrix (see Fig. 
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4). Most of the consensus methods for weighted trees only apply to dendrograms, 
however. In the following sections, I will present two methods to obtain consensus 
phylogenies from additive trees in general. 



3.1. Average method 

As for ultrametric trees, the average procedure can be used to construct consensus 
additive trees (see Lapointe and Cucumel, 1997). The procedure is basically the 
same and consists of applying a least-squares algorithm for additive trees to a 
matrix of average path-length distances, computed over all trees combined. When 
such a method is used to combine our three additive phylogenies (Fig. 4), the 
average consensus obtained (Fig. 4e) is much closer to the corresponding total- 
evidence phylogeny (Fig. 4d) than what was obtained with the ultrametric 
procedure (Fig. Id vs. Fig. 3c). 



Fig. 4. (a-c) Three additive phylogenies derived from different genes (see Fig. 1). 
The total-evidence tree (d) was obtained by combing all data in one phylogenetic 
analysis. The average consensus tree (e) was obtained by combining all trees. 
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3.2. Spectral method 

When a least-squares algorithm is used to a build a consensus from average path 
lengths, internal and terminal branches of the trees combined are not equally 
weighted (Lapointe et al., 1994). The spectral decomposition method (Hendy and 
Penny, 1993) can be applied to circumvent this problem, however. The procedure 
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(see Hendy, 1991) is to find the tree spectrum that minimizes the sum of the 
squared distances from a given data spectrum. In a similar fashion, a spectral 
consensus can be derived by finding the closest tree from an average spectrum 
representing different tree spectra. In the present example, the consensus computed 
with the spectral decomposition method (not shown) is topologically identical to 
the average consensus tree (Fig. 4e), with small differences in branch lengths. This 
interesting result can be explained by the very short internodes of the trees 
considered (Fig. 4), thus reducing the gap between the spectral and average 
procedure. 



4. Consensus supertrees, with branch lengths 

While some supertree methods have been proposed to combine unweighted trees 
(Gordon, 1986), there have been less effort devoted to the combination of weighted 
trees containing overlapping sets of leaves. Brossier (1990) has been a pioneer in 
this field with his piecewise hierarchical procedure for dendrograms. The average 
consensus procedure was also applied to overlapping trees (e g., Lapointe et al., 
1994), using the tree properties to fill the empty part of the average matrix (Landry 
and Lapointe, 1997). It is still not clear yet that such a procedure could be extended 
to the spectral consensus. 
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Summary: We present a survey of the literature on the consensus of classification 
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1. Classification trees 

1.1. Classification trees. Among the various types of trees (connected and acy- 
clic graphs) of the literature, trees used in classification have two common properties. 
In a classification tree (CT), where, as usual, V is the finite vertex set of T and E, 
the edge set of T is a set of unordered pairs of V\ (i) no vertex, except possibly a 
unique distinguished one, not a leaf, has two neighbors; and (ii) T is leaf -labelled, 
that is the labels of the set X of the leaves of T are the only relevant ones. 

When it exists, the distinguished vertex of Condition (i) above is the root of T and 
T is said to be rooted (and unrooted otherwise). The tree T may be valued, that is 
endowed with an edge length (a positive real function on E)\ otherwise, it is 
unvalued. Hereunder, the length of the unique path between two vertices v and v' 
is denoted as ^(vv'). 

Each usual type of CTs admits several other equivalent definitions. Every consensus 
approach is related with a particular description of the involved CTs. 

1.2. Hierarchies. A hierarchy // on X is a rooted and unvalued CT. It is well- 
known that, equivalently, // is a set of subsets {clusters) of X satisfying the 
conditions: 0 ^ X e H\ for all x e X, {x} e H\ for A, B e H\ Ar\B 6 

{ 0 } . Then, for any A c X, there is a minimum cluster Ah in H containing A. 

Two other equivalent definitions of hierarchies appear in consensus studies: 

A hierarchy may be defined as the set of the triples xy/z of elements of X such 
that {x,y}// c {x,y,z}//. Such sets of triples have been characterized by Colonius 
and Schultze (81). More generally, a nesting (Adams 86) is a pair {A,B) of subsets 
of X such that Ah c Bh and a hierarchy is defined by the set of all its nestings. 

1.3. Dendrograms. Dendrograms correspond to rooted and valued CTs, with 
the further assumption that the length of any path from the root to a leaf is constant. 
It is well-known (Johnson 67) that a dendrogram is equivalent to an ultrametric 
function, that is a real non-negative function uonX^ satisfying u{x,x) = 0, u{x,y) 
= u{y,x), and u{x,z) < max(w(x,y),w(y,z)), for all x,y, z^ X. The correspondence 
between a dendrogram and its ultrametric u is given by 2u{x,y) = t{xy). Again, 
some other equivalent structures are useful: 

• An indexed hierarchy (Benzecri 67) is a pair {H,i), where // is a hierarchy on X 
and i is the real height function on H defined by i(A) = max{ u{x,y): x,ye A). 
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• A residual mapping from R'*’ to the set of the partitions ;rof X: the converse image 
of the set of all partitions coarser than ;rhas a minimum (Janowitz 78, Leclerc 91). 

1.4. Ranked hierarchies. Ranked hierarchies are intermediate between hierarchies 
and indexed hierarchies. A ranked hierarchy is endowed with a complete, compatible 
with inclusion, preorder on its clusters. Equivalent structures are sets of pairwise nested 
partitions (including the coarsest [X] and the finest { {;c}: x € X} ones), and ultra- 
metric dissimilarity preorders, that is preorders on X ^ which are complete and satisfy 
XX < xy, XX « yy and xz ^ xy or xz ^ yz, for all jc, y, z e X (Barthelemy et al. 84, 86). 

1.5. Phylogenetic trees. Now we consider unrooted CTs on X, where 
phylogenetic trees (abbreviated here as PTs) correspond to the unvalued case. 
Equivalent structures are: 

• sets of pairwise compatible splits (bipartitions) (A,A') of X. Two splits (A,A') 

and are compatible when one of the sets AnR, Ar\B\ A'nB, A'C\B' is 

empty (Buneman 71); these splits correspond with the edges of the tree; 

• sets of quadruples xy/zw of elements of X satisfying some conditions given by 
Colonius and Schultze (81). 

Consensus methods on hierarchies and PTs are frequently closely related by the 
following one-to-one correspondence (Day 85). Let a€ X; a PT T onX+{a} is 
obtained by adding the edge ar to a hierarchy H with root r. The split (A,(X-A)-f [a]) 
of T comes from the cluster A of //, and the quadruple xylza of T from the triple 
xy/z of H (so, the entire set of quadruples is redundant to define the tree T). 

1.6. Tree metrics. An unrooted and valued CT is associated with a tree metric, 
that is a dissimilarity satisfying the four-point condition: for all x, y, z, w e X, 
t{x,y)-{’t{z,w) <max(r(;c,z)+^(y,w),r(x,w)+r(y,z)). This is a well-known characteristic 
property of the path lengths functions t associated to these trees (Buneman 71). 
Equivalently, unrooted and valued CTs may be defined as weighted sets of pairwise 
compatible splits. 



2. The consensus problem 

2.1. Aims and approaches. The consensus problem consists of aggregating 
several objects into a unique object of the same type. In the case of classification 
trees, the most frequently mentioned situation is that of a clustering method applied to 
several data sets, all describing the same objects, and giving different CTs. In 
another situation, several clustering methods are applied to the same data set, each 
giving a CT. The purpose may be either to recover an hypothetical "true” tree (as in 
phylogenetic reconstruction), or to find a tree summarizing at best the given set of trees. 
Consider a set Tof CTs. A profile of length/: of (Tis a/:-tuple T* = of 

elements of T; we set X = { 1 Let T* be the set Ua: of all the profiles of 

finite length of ‘T. A k-procedure is a map from to T. A complete procedure is 
a map from T* to T. A complete multiprocedure is a map from T* to 2"^-{0}. 

Three overlapping approaches have been used to tackle the consensus problem: 

• The constructive approach: a way to construct a consensus is explicitly given. 
Obvious constructions (as the unanimity, or strict consensus rule) are generally 
not adequate and more sophisticated constructive rules are needed. 
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• The combinatorial optimization approach. A criterion measuring the remoteness 

of any CT T to the given profile T* is defined; we search for the 
solutions of: Min subject to T e T. 

• The Axiomatic approach: "to sit in an armchair and think of desirable properties 
that a consensus method should possess, and then attempt to find the methods 
satisfying these properties" (McMorris 85). This approach leads frequently to 
problems of existence and uniqueness. 

In the case of valued trees, e.g. dendrograms, two strategies are encountered. 
Instead of a direct study of the consensus of dendrograms, it is possible to consider 
the corresponding indexed hierarchies, then stripping heights to obtain hierarchies 
and use a consensus method defined on hierarchies. A final (possible) step may be to 
determine an height function for the obtained consensus hierarchie(s). Such an 
approach is more transparent when one focuses attention on clusters. 

2.2. A corpus. Starting from 1972 (Adams), we have found more than 90 
references, not all given here because lack of place, of papers partly or totally 
devoted to consensus tree methods. Among them are mathematical papers on 
consensus where cases of CTs are explicitely referred to (e.g., Barthelemy et al. 84, 
86 , Monjardet 90, Barthelemy and Janowitz 91) or surveys on topics including CTs 
where the consensus problem is explicitely referred to (Arable and Hubert 96, 
Gordon 87, 96, Leclerc and Cucumel 87, Lapointe 97). Most of these papers consider 
hierarchies, whereas few works concern unrooted phylogenetic trees or tree metrics. 
Papers dealing with related topics (consensus of other types of classifications as 
partitions or weak hierarchies, formal theory of consensus, comparison of two or 
more CTs and consensus indices) are quoted here only when needed. 

3. Consensus methods and uses on classification trees 

3.1. Quota rules and federation consensus functions. The q quota rule 
consists of retaining the features present in at least q elements of the profile. Quota 
rules are constructive ^-procedures (complete procedures when replacing q with a 
percentage). For special values of q, one gets classical consensus rules: the 
unanimity rule, or strict consensus for q = k, and the majority rule when q is 
equal to p, the least integer greater than half the length k of the profile. 

When trees are described as subsets (of clusters, of splits...), the q quota rule tree is 
equal to 7 = card(/)>^ this set corresponds to a tree. 

In the case of a set T of CTs endowed with a lattice, or a semilattice structure (with a 
meet a and a join v), lattice quota rules may be defined by the similar formula: 

^ ~ C2ivd{I)>q ^^iel 

with, as above, algebraic unanimity {q = k) and majority {q = p) rules. A further 
generalization is constituted by the federation consensus functions (FCFs). 

Let ^be a family of subsets of K\ define a consensus function F 7 by 7 = F gr{T^) = 
V/e Ti). Oligarchic consensus functions 7 = A/e/ 7/ correspond io !f = {/}, 

I <^K, dual oligarchic 7 = V/^/ 7/ to ^ = {{/}:/€ /} for a fixed subset / of 
K, and dictatorial consensus functions 7 = 7 / to ^ ={{/}} for a fixed i e K. 

These mathematical formalizations are very efficient in the search of optimization and 
axiomatic properties, especially for the majority rule. A basic result of formal 
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consensus theory is that, if the tree structure is a matter of pairwise compatibility, 
then the ordered set (‘T,c) is a median semilattice and the set T corresponds to a 
tree as soon as ^ > /?. Moreover, the quota rule is axiomatically characterized and 
the majority rule element is the solution of an optimization problem (Monjardet 80, 
Bandelt and Hedlikova 83, Bandelt and Barthelemy 84, Barthelemy and Janowitz 
91, McMorris and Powers 95, McMorris et al. 96; see Sections 3.3 and 3.4). Formal 
axiomatic results about FCFs are given in Monjardet (90; see Section 3.4). 

Their degree of abstraction may make such consensus rules appear somewhat formal. 
In fact, they are effectively computable in many cases (Day 83, 85, Day and Wells 84). 

Strict consensus. Applied to hierarchies as sets of clusters, it is the early consensus 
method for CTs (as in other domains - see Day 86); it is still the most widely used 
one (Sokal and Rohlf 81, Rohlf 82, Wilkinson 94), sometimes with the addition of 
other selected compatible clusters (Nelson 79, Bremer 90). 

The Adams consensus method may be also described as a unanimity rule on 
hierarchies as sets of nestings (Adams 86). Axiomatic results about this approach, 
and the unanimity rule on hierarchies as sets of triples are given in Vach (94). 
Majority rule. It applies in the cases of median semilattice structures: hierarchies as 
sets of clusters (Margush and McMorris 81, Day 83, Barthelemy et al. 84, 86, 
Barthelemy and McMorris 86, Barthelemy and Janowitz 91, Leclerc 93, 94; see 
Sections 3.4 and 3.5), ranked hierarchies as sets of partitions (Barthelemy et al. 84, 
86), phylogenetic trees as sets of splits (Barthelemy et al. 86, Barthelemy and 
Janowitz 91). Margush and McMorris introduced it as an alternative to the too strong 
strict consensus, which frequently gives an almost trivial hierarchy. 

Other values of q have been considered for hierarchies as sets of clusters 
(McMorris et al. 83, Barthelemy 88 with an axiomatic characterization, Barthelemy 
and Janowitz 91) and phylogenetic trees as sets of splits (Barthelemy and Janowitz 
91). A variant may be found in Wilkinson (96; see Section 3.5). 

Federation consensus functions. They have been considered for hierarchies as sets 
of clusters (McMorris and Neumann 83). In this case, the elements of must 
pairwise intersect to ensure that T is a hierarchy. Stinebrickner (84b) gives axiomatic 
properties of the oligarchic functions on dendrograms and, then, hierarchies by 
introduction of a height function (in relation with the intersection methods of the next 
section). In the study, in the same direction, of Neumann and Norton (86), the dual 
oligarchic rule is also considered. An axiomatization of the dual oligarchic rules (and 
of rules belonging to a wider class) on ultrametrics is obtained in Leclerc 84 (see also 
Barthelemy et al. 84, 86, Leclerc 91 and Leclerc and Monjardet 95). 

3.2. Intersection methods. As observed by several authors, the description of 
hierarchies as sets of clusters may lose some common information, recoverable by 
methods able to produce clusters present in no element of the profile. Intersection 
methods are constructive complete procedures on hierarchies where any cluster C e 
H is obtained by intersection of clusters Ci e ///. The intersections should be 
made selectively, since the collection of all possible intersections does not generally 
correspond to a hierarchy. 

In the celebrated Adams method (Adams 72), the Q's are at the top of the not yet 
intersected clusters of each hierarchy ///. Equivalent definitions and axiomatic 
characterizations are given in Adams (86), Vach and Degens (88) and Vach (94). 
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Neumann (83) defines generalized intersection rules, where the choice of the C/'s 
depends on the introduction of a height function r on the H-s (then becoming 
dendrograms): the intersected Q's are those of common height in each Hi (that is, 
an unanimity rule for dendrograms is applied). For instance, the Durschnitt or top- 
down height r(C/) is the number of edges of a path from the root to C/, the 
bottom-up height is the maximum length of a path from Q to a leaf € C/, and the 
cardinality height is the number of elements of Q. Axiomatic results are obtained in 
Neumann (83), Neumann and Norton (86) and Powers (95). Especially, an interes- 
ting betweenness axiom is satisfied by the Durschnitt intersection rule (Neumann 
83): for each C/ e there is C g // such that Pi / C/ c C c U / C/. Such 
results may help to justify the choice, arbitrary at first sight, of the height function. 

A variant (^-consensus methods) is proposed by Stinebrickner (84a): an index h 
measures the adequacy of each cluster obtained by intersection and the clusters with 
h greater than a fixed level .y (for instance h{C) = cardC/card(U/ C/)), are 
selected. In the case of the cardinality height function, such a method is intermediate 
between pure intersection {s = 0) and strict consensus (^ = 1). Further formalization 
and developments are found in Day and McMorris (85) and Stinebrickner (86). 

3.3. The median procedure. In the case of CTs as elsewhere, the most 
frequently considered consensus method based on optimization is the median 
procedure. Consider a distance function d on T. The problem is to minimize the 
remoteness function = YieK d{T,Ti), subject to 7 g T. This is a complete 

multiprocedure, since it may exist several solutions called medians. The median 
procedure applies to all metric spaces (Frechet 49; see the survey of Barthelemy and 
Monjardet 81 with its 88 updating where cases of CTs are considered). 

For CTs, the median procedure was introduced by Margush and McMorris (81), in 
the case of hierarchies as sets of clusters endowed with the symmetric difference 
metric. It is shown in this paper that the majority rule hierarchy is a median, a fact 
proved afterwards in all median semilattice structures (see Section 3.1). The algebraic 
characterization of median hierarchies is completed in Barthelemy et al. (84, 86) among 
similar cases; an axiomatic characterization of the median procedure is given in Barthe- 
lemy and McMorris (86), leading to the formal studies of Barthelemy and Janowitz 
(91) and McMorris and Powers (95). A probabilistic interpretation of the median 
procedure in the spirit of Condorcet, as read by Young (88), is given in McMorris 
(90). It is pointed out in Leclerc (93) that the medians remain the same for a wide 
variety of metrics. These results apply directly on phylogenetic trees as sets of splits, 
and also on ranked hierarchies as sets of partitions (Barthelemy et al. 84, 86). 

When a median semilattice structure is no longer available, the median procedure 
generally leads to A^P-hardness results, as for binary hierarchies (McMorris and Steel 
94), or ultrametrics endowed with the classical LI or L2 distances, by reduction to 
partitions (see, e.g., Barthelemy and Leclerc 95). Such situations lead to interesting 
algorithmic studies, as in the unusual approach of Philips and Wamow (96), replacing 
symmetric differences with set differences (on phylogenetic trees as sets of splits). 

As a special instance of a family of solvable monotonic regression problems, Chepoi 
and Fichet (97) give the optimal solution for the consensus of ultrametrics with the 
criterion p'(w,w*) = max/^^J(w,W/), where d is the Loo distance. 
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3.4. Arrowian axiomatics. Many axiomatic studies are based on a main axiom 
(independance, stability,...) of the type introduced by Arrow (51) in the domain of 
social choice: if two profiles T* and T'* agree on their restrictions to a subset S, 
then the consensus trees T and T also agree on S. A stronger neutral monotonicity 
axiom states that: if the profile T* dominates (in some ordinal sense) the profile T'* 
on 5, then T dominates T' on S too. Such studies apply generally to k- 
procedures. A first remark is that there are generally various ways of setting an 
axiom of the above type; for instance, for hierarchies, one may consider restrictions 
of trees to a subset 5 of X (stability axioms), or restrictions of hierarchies as sets of 
clusters to a subset S of the power set of X (decisiveness axioms). 

The results obtained in a formal theory of arrowian consensus in semilattices (Bar- 
thelemy et al. 86, Monjardet 90, Leclerc and Monjardet 95) apply to hierarchies as 
sets of clusters (characterization of federation consensus functions using neutral mo- 
notonicity axiom by McMorris and Neumann 83) and ultrametrics (characterization of 
generalized oligarchic consensus functions using a decisiveness axiom by Leclerc 84). 
Otherwise, the results concern often (but not always) dictatorial consensus functions: 
Hierarchies. Characterization of dictatorial consensus functions using various stabi- 
lity axioms (Neumann 83, Barthdemy and McMorris 89, Barthelemy et al. 91); axio- 
matic studies on dictatorial consensus functions (Barthelemy et al. 92); study of various 
stability conditions and the related consensus functions (Barthelemy et al. 95). 
Phylogenetic trees. Characterization of dictatorial consensus functions using 
stability axioms (McMorris 85, McMorris and Powers 93). 

3.5. Pruning, reduction and supertrees. Many recent works concern pruning 
or supertree approaches. In pruning, one renounces to obtain a consensus tree on the 
entire set X and seeks a maximal subset T of X such that all T/s agree on Y. So, 
the consensus tree is a CT on Y. The idea, initiated by Gordon (79) for ranked hierar- 
chies, has been mainly studied for hierarchies (Gordon 80, 81, Finden and Gordon 85 
- with a regrafting procedure, thus leading to a hierarchy on X, Kubicka et al. 92, 95, 
Goddard et al. 95). In the "reduced" (Adams, majority rule) methods of Wilkinson 
(94, 96), maximum subsets corresponding with a consensus tree are also obtained. 

A converse approach is the search of consensus supertrees for trees defined on 
different overlapping sets (Gordon 86, Steel 92, Lanyon 93, Constantinescu and 
Sankoff 95, Ronquist 96). As noted by Steel, this problem is related with the 
fundamental cladistic compatibility one. Considerations on the relations between 
genes and species lead to another kind of supertrees (Page and Charleston 97). 

3.6. Other constructive approaches. Baum (92) proposes to use a binary 
encoding (Farris et al. 70) of hierarchies Hi and reach a consensus tree by addition of 
the encodings and parsimony techniques. Several authors have completed or 
improved this method (Ragan 92, Rodrigo 93, Baum and Ragan 93, Purvis 95a). 
Lefkovitch (85) merges ultrametrics in an euclidean space and applies principal 
component analysis to find a "mean ultrametric".This approach is more developed 
and discussed in the work on average ultrametrics of Lapointe and Cucumel (97). 

4. Final remarks 

In the phylogeny reconstruction domain, there is a controversy about the relevance of 
consensus trees (Anderberg and Tehler 90, Barrett et al. 91, 93, de Queiroz 93, 95, 
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Nelson 93), the alternative being to combine first the data sets before the construction 
of a tree. From a formal point of view, the question is perhaps: what step of the 
process is the best to apply a consensus method? Indeed, several authors have made 
advances, frequently based on asymptotic statistical distributions, in the problem of 
the validation of consensus trees (Shao and Rohlf 83, Shao and Sokal 86, Steel 88, 
Hendy et al. 88, Sanderson 89, Lapointe and Legendre 90, 92a, b, 95, Purvis 95b, 
Cucumel and Lapointe 97, Lapointe 97). 

In the domain addressed here, the increase of knowledge has provided tools for good 
choices among the many available methods. Besides statistical validation, the most 
interesting results in the consensus problem concern relations between the three 
approaches described in Section 2.1. Here, interactions between consensus of 
classifications and formal consensus theory have been important and fruitful. 
Especially, consensus of trees have prompted significant theoretical progresses. 
Conversely, results concerning trees have been obtained (or now appear as) special 
cases of mathematical results (Margush and McMorris 81). They also lead to 
extensions to other classification models (McMorris and Powers 97). 
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Abstract: Phylogenetic Information Content, a class of measures of the 
information provided by consensus trees based on the number of permitted 
resolutions of the consensus, is introduced. A formula for the number of 
permitted resolutions of Adams consensus trees is derived and a proof given. 
We argue that maximising PIC measures provides a sensible criterion for 
choosing among alternative consensus trees and we illustrate this for consensus 
trees of cladograms. 

Keywords: Consensus, Phylogeny, Adams resolutions. Information Content. 



1. Introduction 

Consensus trees (CTs) are used by systematic biologists, in many different 
contexts, to summarise graphically the agreement between multiple 
fundamental phylogenetic trees. The development of many alternative 
consensus methods raises the issue of choice between methods and/or tree(s). In 
this paper we derive a class of measures of the information provided by 
phylogenetic trees, argue that in certain contexts the optimal CTs are those 
which convey the most information, and illustrate the approach. 



2. Phylogenetic Information Content 

In order to convey information a CT must prohibit a subset of the possible 
phylogenetic relationships (Mickevich and Platnick, 1981; Wilkinson, 1994). 
Given that the information (I) conveyed by an event is 



/ = -log 



probability of the event 
^ probabilities of all possible events 



( 1 ) 



where the base of the logarithm determines the unit of information (e.g. log 2 - 
bits\ In - nats), then, under the assumption that all possible phylogenies are 
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bifurcating and equiprobable, the Phylogenetic Information Content (PIC) of a 
tree is 



PIC = -log 



number of permitted bifurcating trees (n ^ ) 
number of possible bifurcating trees ) 



( 2 ) 



Because the concept of a phylogenetic tree is a general one, PIC defines a class 
of tree information measures with specific measures existing for each type of 
phylogenetic tree. In this paper, we focus exclusively on cladograms, which are 
the «-trees of Bobisud and Bobisud (1972), and the corresponding PIC measure 
which we refer to as Cladistic Information Content (CIC). 

For a CT to provide phylogenetic information, /?r must be less than nj, i.e., it 
must be possible to deduce from the CT alone which of the possible trees could 
not have been represented among the fundamentals. We refer to CTs which, 
unless totally unresolved, always fulfil this condition as prohibitive. Permissive 
CTs are those which permit all possible trees (i.e. «r =«t) irrespective of their 
resolution. In order to be prohibitive, CTs must be strict sensu Wilkinson 
(1994), i.e., their groupings must represent a particular type of phylogenetic 
relationship (components, «-taxon statements or nestings) that occurs in all the 
fundamentals. 



3. Calculating CIC 

The number of possible bifurcating rooted cladograms («t) is a function of the 
number of leaves n and is given by RB(n) 

RB(n) = (2n-3)\\. (3) 

Rohlf (1982) provided equation (4) for calculating the number of permitted 
bifurcating trees («r) for strict CTs of rooted cladograms whose groupings 
represent components or other ^ 2 -taxon statements, i.e., the CTs produced by the 
strict component (Sokal and Rohlf, 1981), strict Reduced Cladistic (RCC), 
Reduced Adams (RAC), Disqualifier-Faithful (DF) (Wilkinson, 1994), and 
Largest Common Pruned (TCP) (Gordon, 1980) consensus methods. 

nj,= (4) 

/G V(T) 



where dt is the number of vertices immediately descendant from vertex i in the 
set of vertices V(T) of tree T. 
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The number of permitted resolutions (tir) of an Adams CT (Adams, 1972), the 
only strict CT to capture nestings (Adams, 1986), is 

~ f^root 



where 



a split of Di 



f \ 

YImj RB{g^)RB{g,) 
vygQ y 

7 6 A 



( 6 ) 



where D/ is the set of vertices immediately descendant from vertex /, a split of a 
set D, is in this case, a partition of D into two non-empty, mutually exclusive 
subsets whose union is D, gA is the number of leaves descendant from subset A 
of V(T), and hj is the number of leaves in the yth subtree descendant from vertex 
i. The proof of this is given in the appendix. 



4. Choosing a CT 

The production of a consensus profile, a set of CTs, by several consensus 
methods means that each method is not synonymous with a single CT. 
Furthermore, the information content of each CT depends upon both the 
properties of the method and the characteristics of the particular fundamentals 
under consideration. Hence our emphasis is on the choice of CT as opposed to 
method, although the question of which method(s) will produce the most 
informative CT and sets of CT(s) is also discussed. 

In Figure 1, the strict component CT is completely unresolved, and therefore 
provides no information, despite the occurrence of considerable agreement 
among the fundamentals. Insensitivity of the strict component CT has been well 
documented (e.g. Adams, 1986; Swofford, 1991). The Adams CT, in contrast, is 
well resolved and provides CIC = 5.159 nats of information. However, the 
groupings in the Adams CT represent nestings and thus the polytomies of this 
tree are more permissive than those in CTs whose groupings represent «-taxon 
statements. The strict RCC 1 and strict RCC 2 CTs provide more information 
than the Adams despite being less resolved and/or including fewer taxa. 

The LCP, RAC and strict RCC consensus methods overcome the problems of 
insensitivity and ambiguity that affect the strict component and the Adams 
respectively, by the exclusion from CTs of taxa whose variable position among 
the fundamentals prevents unambiguous summary of the relationships common 
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to the remaining taxa. Multiple CTs are produced when, as is usually the case, 
there is more than one way to exclude taxa to achieve unambiguity. 

Figure 1: Two fundamental cladograms (a,b), and their strict CTs [Modified 
and extended from Adams, 1972; Wilkinson, 1994] 



Fundamental 1 Fundamental 2 strict Component Adams 




0.000 nats 5.159 nats 



strict RCC 1 strict RCC 2 strict RCC 3 strict RCC 4 




5.442 nats 5.753 nats 2.197 nats 1.099 nats 



LCPl LCP2 RACl RAC 2 RAC 3 




4.654 nats 4.654 nats 4.654 nats 2.708 nats 1.099 nats 



LCP trees are produced by pruning single taxa from the fundamental trees until 
they are rendered identical. RAC trees are produced by pruning branches from 
the polytomies of Adams CTs until the polytomies can be correctly interpreted 
as «-taxon statements. The strict RCC method first identifies the complete set of 
non-redundant «-taxon statements common to all the fundamentals. These n- 
taxon statements are then represented graphically through the production of a 
consensus profile. 

Because of the way the RCC profile is constructed, no LCP or RAC tree can 
ever provide more information than the most informative RCC tree (e.g. Figure 
1). Given that the strict component CT is included in the strict RCC profile 
unless it is completely unresolved, then choice of the most informative CT must 
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be between the Adams CT or a tree from the RCC profile. In this example, strict 
RCC 2 maximises CIC. An example in which the Adams CT is the most 
informative CT is provided by Adams (1986; Figure 2). 



5. Choosing a set of CTs 

CIC can also be used to select the set of CTs which provide the most cladistic 
information. However, the occurrence of the same information in more than one 
tree means that the combined CIC of a set of CTs can be less than the sum of 
their individual CICs. If two strict CTs, both representing ^-taxon statements, 
have the relationship that all taxa present in one of the trees also occur in the 
other, then the information common to the two CTs is the cladistic information 
content of the single CT in their strict RCC profile. However, if this relationship 
does not hold, then the only currently available way to determine combined CIC 
is with a brute force approach (i.e. compare every single possible tree with the 
set of CTs to determine whether it is permitted). The problem of selecting a set 
of CTs from a consensus profile so as to maximise combined information 
content appears to be computationally complex. 

Applying the brute force approach to the strict RCC profile in figure 1, we find 
that the pair of CTs which maximise combined cladistic information is {RCC 2, 
RCC 3} (CIC = 6.954 nats). RCC 1 provides no information that is not already 
conveyed by RCC 2 and RCC 3, whilst conversely, all the information provided 
by RCC 4 is non-redundant with respect to the other three trees in the profile. 
Thus the combined cladistic information content of the entire strict RCC profile 
is 8.053 (6.954 + 1.099) nats. Because the strict RCC trees graphically represent 
all «-taxon statements common to all the fundamentals, the TCP (combined CIC 
= 5.753 nats) and RAC (combined CIC = 5.589 nats) profiles will never provide 
more information than the strict RCC profile. Furthermore, all the information 
provided by these two alternative profiles will be represented in the RCC 
profile. In fact, only the Adams CT can convey cladistic information that is non- 
redundant with respect to the strict RCC profile. 



6. Discussion 

Measures of information content provide a basis for choosing amongst 
alternative strict CTs and methods. Due to the probabilistic nature of 
information, it may also be possible to calculate the phylogenetic information 
content of majority-rule and other CTs with groupings that represent 
relationships occurring in less than 100% of the fundamentals. 
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Although this paper focuses solely on cladistic information, dendrograms, 
which are equivalent to cladograms with internally ranked nodes, are another 
very important type of phylogenetic tree. Whilst consensus methods which take 
into account the rank of internal nodes have been developed (Neumann, 1983; 
Stinebrickner, 1984) they suffer from problems of ambiguity or are permissive. 
Prohibitive dendritic consensus methods are currently under development by the 
authors. 

The utility of our measure of phylogenetic tree information content extends far 
beyond the selection of CT(s). For example, a natural measure of tree similarity 
is the information common to the set of trees (Mickevich, 1978). Conversely, a 
measure of tree dissimilarity could be based on the symmetric difference of the 
information provided by each tree. Thus, PIC defines a class of tree similarity 
(and dissimilarity) measures for studies of congruence that can also be 
normalised to produce Consensus Indices (Mickevich, 1978). 



Appendix 

Proof of equation 6. 

We need the following simple lemma: 

Suppose we have a forest of k rooted binary trees T\,...,Tk with leaf sets L\,...,Lk 
respectively, such that ljf=i 4 = {1,...,a?} , and n Lj =0 for all 1 < / < j < A:. 

Let the number of leaves in each tree T, be a,. The number of rooted binary trees T 
with leaf set which contain as subtrees each of the T, is given by 



/=1 

Proof: It is well knovm that the number of rooted binary trees with n leaves is 
(2«-3)!! = (2/7-3)(2«-5)...(3)(1) = (2/7-3) !2^'7(«-2)!. One way in which we may 
generate all the possible rooted binary trees is by successively adding leaves 
l,...,/2 to each of the edges of the tree. By adding aj leaves in a similar way to 
tree T\ we can create all the possible trees with (a 1+^2) leaves, containing T\ as 
a subtree, in (2ai - \){2ax + 1). . . (2(aj + ^2 ) - 3) = RB{ax I RB{ax ) ways. 
Similarly, the number of trees with {a\+ai) leaves containing Ti as a subtree is 
RB{qx Qi) I RB{a 2 ) • The above argument is independent of the underlying 
trees T\ and T 2 : ergo the number of trees on n> (ai+a2) leaves containing T\ 
and T 2 is RB(n) / (RB(ai)RB(a 2 )) . The argument is easily extended to T\,...,Tk. 
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Let an Adams resolution of a vertex v in a tree T be a resolution of v which is 
consistent with treating T as an Adams consensus tree. Now let ju^ be the 
number of resolutions of the subtree rooted at vertex i of a given Adams 
consensus tree T, Then number of Adams resolutions of the 

complete tree T. We proceed from the tips of T to the root, finding the number 
of Adams resolutions of each vertex j before we continue to the ancestor of j. 
Let Dj be the set of vertices which are immediately descendant from vertex j and 
let \Dj\ = dj. Let Lj be the set of leaf vertices in the subtree 7} rooted at 7 , and \Lj\ 
= lj. Thus n = I root and the degree of vertex j is dj-^\. Lastly let G(A) be the set of 
leaves descendant from the vertices in the set A ^V(T), with cardinality gj. 

Consider vertex i eV(T), the immediately descendant vertices Vy eD^ of 
which have been resolved into trees Ti,...,T^_. According to Nelson and 

Platnick’s Interpretation 2b (Nelson & Platnick, 1980; Wilkinson, 1994), any 
Adams resolution of vertex / must maintain the branching order of the vertices 
in 7] , but this resolution has no other constraints. Thus we may take any 

split (A,B) of Di and resolve the vertices descendant from A to one side of i and 
the vertices descendant from B = A to the other side, while maintaining 
the branching order within each of the Ti,...,r^ . The number of Adams 

resolutions at vertex i is thus the sum over all possible splits (A,B) of A, of the 
number of rooted binary trees on leaf set (J Lj , containing each of the 

resolved subtrees immediately descendant from i. The result follows. 
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Abstract: Fuzzy partitions with large (small) coefficients at the same locations 
are considered to be compatible. We investigate to what extent two or more fuzzy 
partitions are compatible and to what extent two fuzzy partitions are 
complementary. Linguistic expressions large and small are defined by 
comparison with a threshold a € (0, 1). We discuss how to find a at which two or 
more fiizzy partitions are compatible and how to find a at which two fuzzy 
partitions are complementary. Then we explain how a-compatibility and a- 
complementarity can be used in decision making. 

Keywords: Compatibility, Complementarity, Decision Making, Fuzzy Partitions, 
a-Cuts, Aggregation. 



1. Introduction 

Fuzzy set theory introduced by Lotfi Zadeh (1965) has been an inspiration for 
many investigators in the theory and practice of classification aimed at 
development of methods which deal with vague description of categories and with 
classes with non precise boundaries. During the last two decades fuzzy clustering 
techniques and methods of fuzzy pattern recognition have received a great deal of 
attention (Bezdek (1981), Miyamoto (1990), Klir (1995), and others). Bezdek 
(1981) introduced the following partition spaces associated with a finite set of 
objects X = {jCj,jC 2 ,...,x„} : 

Hard k -partition space {k>2)\ 



Pk = \u ^Vh,-Uij e = I for all f,^Uy > Oforallii, 



( 1 ) 



where V,. denotes the set of real kxn matrices. Fuzzy ^-partition space (k>2): 



Pfk=\u e[0,l];Xw^ =\forallj\Yj^ij > Oforallii. 



( 2 ) 
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It is obvious that cz Pjj^. Bezdek has also defined the degenerate fuzzy and 
hard ^-partition spaces, denoted by Pjj^ and respectively, as the superset of 
Pj^ and Pj^ obtained by the condition 0 < ^ u.j for all i. The coefficient w .. in 

j 

(1) and (2) is the grade of membership of object Xj, j e in fuzzy 

cluster w. , i e {1,...,/:} = . 

Suppose that we have fuzzy partitions U,V e . If U and V have their large 

(small) coefficients at the same locations, we say that U and V are compatible. If 
U has its large (small) coefficients at the locations where V has its small (large) 
coefficients, then U and V are complementary. When we want to exhibit an 
element Xj e.A^that typically belongs to a fuzzy cluster w. from U, we may 

demand its membership value to be large, i.e., greater that some threshold 
a €[0,1]. For each a €[0,1] the a-cut of U €P^^is the matrix IT" €F^, where 

ufj = 1 if u^j > a and = 0 otherwise. Then for each {ij) eN/^ x N^: 
w. = supmin{Q',wn . 

ij J 

We say that fuzzy partitions U,V e Pjj^^ are a -compatible if 

V, — 

The partitions U,V ^Pjj^ are a -complementary if D{LP ^ The value of 
D(U^ indicates to what degree U and V are a -complementary. The value 
of 1-Z)(f/“,F“) indicates to what degree U and V are a -compatible. We 
address the following problems: How to find a €[0,1] at which given fuzzy 
partitions are compatible? Does such an a always exist? How to find a €[0,1] at 
which two fuzzy partitions are complementary? Does such an a always exist? 

The existence of two or more fuzzy partitions of the same set of objects creates a 
problem: which one to choose for further analysis or application? The choice may 
depend on several criteria. The last section of our paper describes an approach 
where the decision is based on information about a -compatibility and a- 
complementarity of the fuzzy partitions under consideration. 

2. Alpha - Compatible Fuzzy Partitions 

Definition 1 Let a €[0,1] and U,V ^P^^o- We say that U and V are a- 
compatible if and only if Uy = for all {iJ) e x N„. 
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Suppose that we have a collection Q } of fuzzy partitions U , e , 

t €.N„. We want to find whether there is an a e[0,l] at which are 

compatible. 

Theorem 1 Let a €[0,1] . Then U“ =0“ =---=U“ if and only if there exists a 
coefficient eN„, (p,q) eN/^xN^ such that for all (i,j) eN^xN„ 

hold: for all t eN„: u^,^j < or for all tsN„: . 

A question arises: when does the coefficient described in Theorem 1 exist? 
There are two trivial cases: 

1. Let = min^y^ }• . Then for all (ij) eN^ x hold: for 

all teN„ we have that . Therefore for a e[0,S^] we get that 

[/" =[/“ =...= [/“= [1] . In this case all coefficients are classified as large. 

2. Let <5j = maXy,{u^,^lJ } 1 . Put ^2 = . Then for all (/,;) gN^xN„ hold: 

for all t eN^ we have < u^^r)pq • Therefore for a e {5^ ,1] we get 

[/“ = 1/2 =...= [/" = [0]. In this case all coefficients are classified as small. 

If (?2 =1 then Uy =U 2 =. . . = at a =1 if and only if the following hold: 

If «(.)0 = I then for all teN„: u^,^j = 1 . 

We say that [/“ represents a - complete partition U if and only if for each j e 
there exists i e such that Uy = 1 . This means that each object Xj e X belongs 

with a large coefficient of membership (compared with the threshold a ) to at 
least one class u. eU . Our goal is to find the largest nontrivial a g ) ? if it 
exists, at which the fuzzy partitions Uy,...,U^ are a -compatible and 
U%t represents an a - complete partition. We can accomplish this goal 

using the method described below. 

Method 1 

Let Uy,...,U^ be partitions from of a set X = {Xj,X2,...,x^} . 

Let = min^., { u^,^y } and = max^,., . 

Step 1 

Put Uq = min^, {max,, . If = dy or = S 2 then stop, because there is only 

a trivial solution. Else go to Step 2. 

Step 2 

Put a = a^ and construct £/“ for all t eN^. 

Step 3 

Find set 5 of elements u^^^y from matrices Uy,...,U^ such that u^^^y =0 and there 
exists s eN^ such that u^^^y = 1 . If 5 = 0 then stop, because 

[/« = =...= U^. Else go to Step 4. 

Step 4 

Put = minimal element from 5. If qTo = a or < Sy then stop, because there is 
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only a trivial solution. Else go to Step 2. 

3. Alpha - Complementary Fuzzy Partitions 

Definition 2 Let a e[0,l] and U,V We say that U and V are a- 

complementary if and only if Uy -\-Vy =1 for all {ij) x N„. 

Suppose that we have fuzzy partitions U^,U 2 ^Pjko- whether 

there is an a g [0,1] at which and U 2 are complementary. 

Theorem 2 Let a e [0,1] . Then and U 2 are a -complementary if and only if 
there exists a coefficient eN 2 ^{p^q) such that for all 

{i,j)eN^ X N„ hold: ^ . 

When does the coefficient described in Theorem 2 exist and how to find it? 

The answer is given in Method 2 below. 

Method 2 

Let U^,U 2 ^Pfko be fuzzy partitions of a set X = {Xj,jC 2 ,...,x„} . 

Step 1 
Put y = i. 

Step 2 

Put ;k 2 ; = {max{W(j),^,W( 2 ),y}} • If there is a class reN,^ such that 

min{W(j)^^.,M( 2 )ry} -Yij then stop, because there is no a g[ 0,1] such that and 
U 2 are complementary. Else go to Step 3. 

Step 3 

Let Y^j = max, . Then for a e(r,;,r 2 y] we get +»(“)„ =1. 

Increase y to y + 1 . If j <n then go to Step 2. Else go to Step 4. 

Step 4 

Calculate n^. (/iy,;K 2 y ] • If (/iy»X 2 y] = ^ then there is no a e[0,l] such that 
and U 2 are complementary. Otherwise for all a eny(/jy,X 2 y] S®t 
t^(i),+<2),=lforall {iJ)eN,xN^, 

4 . Application 

Suppose that we have a collection of fuzzy partitions Q = { C/j , . . . , g Pjj^ , 

t eN^ of the same set of objects. Different partitions arise because of different 
classification procedures or because of different opinions of several experts. If we 
want to find a representative of Q , the following approach might be useful: 

Step 1 
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We will find all nontrivial a g [0,1] at which fuzzy partitions from Q are 
compatible. We will choose a closest to 1/k or a closest to the user’s 
recommendation. Because of the compatibility of fuzzy partitions, it make sense 
to create an aggregate fuzzy partition from Several methods of 

aggregation are available (Bodjanova (1997)). Then the partition or a 
partition eQ, which is the most similar to will be chosen as the 

representative of the collection Q . 

Step! 

If there is no acceptable a g [0,1] at which the fuzzy partitions from Q are 
compatible, we will investigate whether Q has some pairs of a -complementary 
fuzzy partitions. Let, for example, and be complementary at an acceptable 
level of a . Then we will repeat Step 1 using sets Qj = Q - {[/^ } , ^^2 = ^ “ } 

and Q3 = Q - ,[/^} , provided that card ^2 for g = 1, 2, 3 . 

Step 3 

If card < 1 , or the fuzzy partitions from are not compatible at an 
acceptable level of a , we should not try to find a single representative of Q . The 
appropriate characterization of Q is simply a table of values D{U%U^) at a = 
1/k or a specified by the user for all (r, s) . 

Example 

Let (7, F, W, Z be fuzzy partitions of a set X = into 3 classes 

characterized by matrices 





'1.00 


0.80 


020 


0.20 


0.10" 




'0.00 


0.00 


0.40 


0.75 


0.90' 


u = 


0.00 


0.00 


020 


0.50 


0.50 


, F = 


0.30 


0.50 


0.40 


0.10 


0.10 




,0.00 


0.20 


0.60 


0.30 


0.40> 




.0.70 


0.50 


020 


0.15 


0.00, 





'0.70 


0.70 


0.22 


0.15 


0.20'' 




"0.35 


0.46 


0.10 


0.34 


0.00' 


W = 


0.15 


0.10 


0.18 


0.60 


0.40 


,Z = 


0.32 


020 


0.34 


0.66 


0.50 




.0.15 


0.20 


0.60 


0.25 


0.40, 




.0.33 


0.34 


0.56 


0.00 


0.50, 



It is easy to check that there is no nontrivial a such that U, F, W, and Z are 
compatible. Method 2 reveals that fuzzy partitions U and F are complementary at 
a G (0.2,0.3] and F is a -complementary to partition IF at a g (0.22,0.25] . Fuzzy 
partitions Z and F are neither a -complementary nor a -compatible at nontrivial 
a. Therefore we will use the subset Q3 = {U,W,Z} and check whether it 
contains a - compatible fuzzy partitions. Method 1 leads to the conclusion that 
[/" = IF" = Z"at a = 0.35 . Then we can represent Q3 by a single representative 
U* G P^3^ created by an aggregation of partitions U, IF, Z. 
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For example, the simplest representative U* of Q 3 is created as the arithmetic 
mean of W and Z, i.e., = (1/ 3)(M^y + . Representative Ul is 

created as the lower bound of Q 3 with respect to the relation a -sharpness, 
a =0.35. This means that for small elements we have ^ mm{u.j,w.j,z.j) and 
for large elements we have u* 2 )ij - • The matrices of U* and U 2 

are 





^0.683 


0.653 


0.173 


0.230 


0 . 100 ' 


u:- 


0.157 


0.100 


0.240 


0i87 


0.467 




,0.160 


0.247 


0.587 


0.183 


0.433, 



and 





' 1.000 


0.800 


0.100 


0.150 


0.006' 




u; = 


0.000 


0.000 


0.1800 


0.850 


0.500 






, 0.000 


0.200 


0.720 


0.000 


0.500> 




It is easy to check that 


t/,*“ = 


c/;“ = u° 


^ =r" 


= Z“ at a =0.35. 
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Abstract: A regularization method using an entropy function is studied 

and contrasted with the ordinary fuzzy c-means. The way in which two algo- 
rithms lead to similar formulas is discussed. Classification functions derived 
from the two methods, which are naturally obtained when the algorithm of 
clustering is convergent, are compared. Theoretical properties of the two 
classification functions are studied. 

Keywords: Fuzzy c-means. Regularization, Entropy, Classification Rule 

1. Introduction 

Recently, a new method of fuzzy c-means has been remarked that is compa- 
rable to the ordinary method of Dunn (1974) and Bezdek (1981). The new 
method may be called an entropy method, as it uses an entropy criterion. 
Several authors have studied the entropy method from different viewpoints. 
Rose et, al (1990) has studied this method from the statistical mechanics 
viewpoint. Li and Mukaidono (1995) have considered the same method as 
a method of fuzzy c-means using the maximum entropy approach. Massulli 
et al. (1997) have applied this method to analysis of medical images. 

In contrast with the foregoing formulations, the same entropy method is 
formulated within the framework of the two-stage optimization of the fuzzy 
c-means in this paper. Only the objective function to be minimized is differ- 
ent from the standard one. The objective function is regularized by adding 
an entropy-type function. Thus, relationships between the standard and 
present methods of fuzzy c-means can be disclosed. 

Fuzzy classification rules are naturally derived from the clustering algo- 
rithms, of which the property is the second subject of the present study. 
It will be shown that the fuzzy classification functions by the two methods 
behave differently, and their behaviors are compared at the cluster centers 
and when the point moves toward the infinity. This result shows remark- 
able difference between the two methods of clustering and thus is useful in 
selecting an algorithm appropriate for a particular application. 

2. Fuzzy c-Means and Regularization by Entropy 

The problem herein is that n objects, each of which is represented by an 
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/i-dimensional real vector Xk = {xk\, ...,Xkh) ^ k = should be 

divided into c fuzzy clusters. Namely, the grade Uik, !<^<c, l<A:<n, 
by which the object k belongs to the cluster i should be determined. For 
each object A;, the grade of membership should satisfy the condition of the 
fuzzy partition: 



M = { {uik) : Uj*; = 1, 1 < A: < n; 

i=l 

0 ^ Uik < 1, I < i < c, 1 < k < n}. 

The formulation by Bezdek (1981) uses the optimization of the objective 
function 



mv) = Y.J2i^i>crdi^k,vi) ( 1 ) 

k=l 

in which d{xk^Vi) = — ?;j ||2 is the square of the Euclidean norm between 

X and m is a real parameter such that m > 1, is called the center of 
the fuzzy cluster z, and U — {uik) and v = {vi^ •••,'^c) ^ 

No useful procedure for the direct optimization of J by (J7, v) : min J ([/, v) 

has been reported, and a two-stage iteration algorithm is used. 

A General Algorithm of Fuzzy c-Means ( Bezdek, 1981): 

(a) Initialize Set 5 = 0. 

(b) Calculate cluster centers 

= min 



(c) Calculate 






= min 

UeM 






(d) If the convergence criterion < e where e > 0 is a given 

parameter is satisfied, stop; otherwise 5 = 5 + 1 and go to (b). 

It is well-known (e.g., Bezdek(1981)) that the optimal solutions in (b) and 
(c) are obtained using the Lagrange multiplier: 



^{Uik)'^Xk 
k=l 

k=l 



Uik — 



^ / d{xk,Vi) \ 



(Remark that when x^ — Vi, put u,* = 1). 



( 2 ) 
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In contrast to the above standard method, the present method uses the 
following objective function. 

c n n c 

J{U,v) = Ui) + A“^ ^ Uik log u,k (3) 

i=l k=l k=l i=l 

n c 

in which the second term ^ ^ Uik log Uik is the regularizing functional that 

k=l i=l 

fuzzifies the membership matrix. 

The two-stage algorithm is applied for the new objective function. The 
optimal solutions in (b) and (c) are derived by using the Lagrange multiplier: 

n 

—\d{xkiVi) 

. (4) 

A:=l 7 = 1 

where the optimal solutions in the entropy method are written as V{ and Uik 
to distinguish them with Vi and Uik in the standard method. 

Although the two membership matrices appear to be quite different, they 
have close relationships. To see this. Let dik = d{xk,Vi) and /3 = > 0. 

Since 

c J, c _ _ -p\og dik 

j=l ^jk j -1 0 log djk 

7 = 1 

Thus, membership matrices in (2) and (4) appear the same by putting A = /? 
and djk = log djk- In particular, the logarithmic transformation djk = log djk 
should be remarked. 



3. Classification Functions 



It is strange that only a small number of studies mention fuzzy classification 
functions derived from fuzzy c-means. 

Keller et ai (1985) discuss fuzzy /c-nearest neighbor classification rule and 
fuzzy prototype classification rule. The latter has a most natural classifica- 
tion function that has been derived from the clustering algorithm. Namely, 
the function for fuzzy classification rule Fi for the class i is 



Fi{x) 



7^1 



( 5 ) 



for the observation x, where Vi is the prototype for the class i. (Remark: 
when X = Vi, we put Fi{x) = 1 and Fj{x) = 0 for j 7^ i.) It is obvi- 
ous that Fj{x) = 1 and this function has the same form as the optimal 
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membership in the clustering algorithm, since Uik = Fi{xk) in (2) and (5). 
The above function is thus obtained from the standard fuzzy c-means al- 
gorithm. In other words, when the standard algorithm has converged, the 
rule of determining optimal membership functions provides the classification 
rule (5), which interpolates the membership values of the clustered objects: 

Fi{X]z) U{j. 

The above argument means that a new classification rule is derived from the 
entropy method. Thus we obtain 

-\d{x,Vi) 

F,{x) = . (6) 

^-Xd{x,Vj) 

i=l 

It is obvious to see that Fi(x), . . . , Fc{x) form a fuzzy partition: 
j=i 

Moreover this function interpolates the membership values in (4): 



— '^ik' (^) 

The proof of (7) is trivial and omitted. 

Comparison between these two functions leads us to a number of observa- 
tions. They are given here without the proofs to save the space, since most 
of the proofs are straightforward. 

First, notice that the maximum value of Fi{x) is at the center Vi by the 
definition, whereas the maximum value of Fi[x) is not necessarily at Vi. 

Define regions in in which the membership for class i is greatest among 
j = 1, ...,c: 

Ri = {yeR>':Fiiy)>F,{y), j^i}, ( 8 ) 

Ri = {yGR''-.Fiiy)>F,{y), j^i}. (9) 

Moreover notice that the crisp c-means is based on the nearest prototype 
allocation. Namely, when the crisp c-means algorithm has converged, the 
object Xk is allocated to the cluster i of the nearest center Vi\ this means 
that the region in which an object represented by a point y is allocated to 
the cluster i is given by 

A'j = {2/ e R'‘ : \\y - ViW > ||2/ - j 7^ i). 

Notice moreover that K{ is a region determined by the Voronoi diagram in 
(Preparata and Shamos, 1985). Then we have 

Proposition 1. 

Ri = Ri = Ki. 
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We note a remarkable difference between the behaviors of the two functions 
as ||xll oo. 

Proposition 2. Assume that Ri {= Ri) is unbounded. Then we can move 
X toward infinity (||a:|| -4 oo) in Then, 



lim Fi{x) 



lim 

l|x||->oo 



Fi{x) - 1. 



Another notable property in the entropy regularization is that the classifi- 
cation function has convex level sets. 

Proposition 3. Fi{x) is a quasi-convex function, in other words, an arbi- 
trary o-cut for Fi{x) 

{F,{-)U = {xeR'^\F,{x)>a} 
is a convex set. 

Note the following equation (where (•, •) is the scalar product), from which 
it becomes easier to see that Propositions 2 and 3 hold. 

{Fi{x)}-^ = ^ gA{d(x,v,)-d(x.«,)} ^ ^ ^ ^ ^2X{x,v,-Vi) 



The following observation is similar to the discussion in Rose et al (1990), 
but the approach is different from ours. 

Let 0{x) be a neighborhood of x (e.g., sphere or cube) and assume that 
P{0{x)\Cj) which is the probability that the observation is in 0{x) with 
the condition Ci is given by a Gaussian distribution, and moreover the prior 
distribution is assumed to be the equal probability P{Cj) = c“L Then from 
the Bayes rule, we have 

f exp(-A|| 2 / - Zif)dy 
P{Ci\0{x)) = 

T L 

JO{x) 

1 f 

S| / exp{-X\\y - Zif)dy 

|0(a:)l Joix) 

io5)ig/o, 

Proposition 4. As 0{x) {x}, 

P{Ci\0{x)) ^ ~ - = Fi{x). 

;^exp(-A||a: - Zjf) 
i=i 
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This proposition implies that the entropy method allows an interpretation 
of Bayesian rule of classification. 

4. Conclusions 

A new concept of regularization in fuzzy clustering has been discussed and a 
method of regularization by entropy has been introduced. Fuzzy classifica- 
tion functions that interpolate the membership values for the objects have 
been discussed. The contrast between the two classification functions by the 
standard fuzzy c-means and the entropy method reveals characteristics of 
these clustering algorithms. Moreover the entropy method has the Bayesian 
interpretation. 

Future studies include investigations of geometric and analytical properties 
of the classification functions. Moreover, study for new methods of fuzzy 
c-means on the basis of the concept of regularization is promising, as the 
concept of regularization is general and has been found to be useful in dif- 
ferent fields of applications. 
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Abstract: This paper considers the simultaneous determination of data 

classification and linear regression models. The clustering criterion intro- 
duced in this paper includes two types of mixing parameters to make a 
balance between two objectives of clustering: the minimization of variances 
within clusters and the minimization of regression errors. The paper propos- 
es an idea on dynamic determination of those parameters in the clustering 
process. 

Key words: Fuzzy clustering, fuzzy modeling, mixing parameters. 



1. Introduction 

This paper considers the simultaneous determination of data classifica- 
tion and regression models in the context of fuzzy modeling. This problem 
was first considered in Hathaway and Bezdek (1993), in which the Fuzzy 
c-Regression Models (FRCM) were proposed. One of the problems is how 
to obtain a balance between the data scattering and regression errors within 
each cluster. 

The idea proposed in Dave (1990) can be applied to this problem, where 
Dave considered an adaptive determination of weights between Fuzzy c- 
Means (FCM) and Fuzzy c-Lines (FCL). Extending this approach, Ryoke et 
al. (1996) developed the Adaptive Fuzzy c-Regression Models (AFCR) that 
determines the mixing parameters dynamically in order to make a b^ance 
between the data scattering and regression errors within each cluster. 

The purpose of this paper is to explore a little further into the dynamic 
determination of parameters toward the discovery of interesting relationship 
and structures in a data set describing various objects and their properties. 
The paper proposes a new idea on the determination of mixing parameters 
in the context of fuzzy modeling. 

2. Fuzzy Model 

The fuzzy model proposed in Takagi and Sugeno (1985) is a nonlinear 
model that consists of a number of fuzzy rules such as 

Rule Ri : if z is F,-, then y = Qiix). (1) 

Here, x = {xi,X 2 ^ • • • ,x^)^ is the vector of consequence variables, z=(zi, 
^ 2 , • • *, Zty is the vector of premise variables, and y = (t/i, t/ 2 , • • • , yr) is 




112 



the vector of response variables. Often, there is an intersection between two 
variable sets (j^i, ^ 2 , • • • , Xg} and {zi, > 2 : 2 , • • • , Zt}. Fi denotes a fuzzy subset 
with the membership function fi{z) which has premise parameters to be 
identified. The regression model Qi{x) is linear in regression parameters 
which should be identified as well. 

The degree of confidence of each fuzzy rule is defined by 



Pijl‘)Pij2i PijSi (2) 

j=l 

where ^ij is a membership function defined in Ryoke et al. (1996). The 
prediction of y is given by 



E /<(^*) 

, ( 3 ) 

E 



where a?* and z* denote actual inputs, and c is the number of rules. 

To develop a fuzzy model mentioned above, the simultaneous determi- 
nation of data classification and regression models is required. Recently, 
Phansalkar and Dave (1997) proposed an idea to make a balance between 
the data scattering and the distances of data from a hyperplane in each 
cluster. The objective of this paper is slightly different from theirs. This 
paper considers the balance between data scattering and regression errors. 



3. Adaptive Fuzzy c-Regression Models 

The AFCR proposed in Ryoke et al. (1996) has two mixing parameters 
which will be explained later in this section. The original proposal is an 
extension of the work in Dave (1990) in which a balance between the data 
scattering and distances of data from a line in each cluster was considered. 

This paper modifies the original idea in Ryoke et al. (1996) so that 
the balance between the data scattering and the errors of any dimensional 
regression hyperplane can be treated well. 

Let U be the fuzzy partition matrix with Uik for (i, A;)-entry. The mem- 
bership grades Uik satisfy the following conditions: 

Wifce[0,l], E«,jb>0, E«i* = l- 1<*<C, l<fc<n (4) 

Ar=l t=l 

where c is the number of clusters and n is the number of data points. Let 
V be the set of cluster centers: V = ^ 2 , • • • , Zc}? let ft be the set 

of regression parameters. 

The clustering objective function J(C/, V", fl) is defined by 
J(f/, K O) = E fi, a.-, Vi) 

fc=i t=i 



( 5 ) 
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where m (> 1) is the fuzzifier and Lik{Vl^ V, rji) is defined by 



Lik{ft, V, a,-, T}i) = aiEik{ft) + (1 - ai)rnDik{V). (6) 

Here, Eik{0) and Dik{V) are defined as follows: 



Eik{^) = ||yfc-ff,(xfc;n)||2, 


(7) 


Dik{V) = \\zk-Zi\Y 


(8) 



By minimizing the objective function J({7, fi), the fuzzy partition ma- 
trix [/, the cluster centers V, and the regression parameters f2 are deter- 
mined. The first term in the function V, ai^rji) is related to the square 

error of the regression model, on the other hand, the second term is related 
to the square distance of a data point from the cluster center. Here, and 
Tfi are parameters which are changed dynamically in the clustering process. 
The parameter a,- is introduced to make a balance between the distances 
of data of the objective variable from the regression hyperplane and the 
distances of data of the conditional variables from the cluster center within 
each cluster. The parameter rji is necessary because the spaces considered 
in the first and second terms in the function are different. 

Let us consider how to determine these clustering parameters. Using the 
data of the objective variable yp and all explanatory variables, we define data 
vectors 

'^kp — {ykpt U2,***,r. (9) 

Let Atip, • • •, \i[s+i)p be eigenvalues of the scatter matrix: 

n 

^ip = {Wkp - Wip) [Wkp - WipY, ( 10 ) 

k=l 

and let i/,i, • • •, be eigenvalues of another scatter matrix: 

5: = - Zi){Zk - ZiY- (11) 

k=l 



The idea in this paper is to determine the parameters and rji as follows: 



a, = mm{aa, • • * , o^ip = r\ i . i/ r r 

V ar[Xijp] -f V arlv'ij] 

s-\-l 

^2 Xijp 

rji = max{r}iu 77,2, • • • , 77, p}, 77, p = . 

E k>ij 

j=i 



(12) 

(13) 



If the variance of eigenvalues in the space where we are building a linear 
model is large, there is a chance to develop a good linear model. In this 
case, we give added weight to the first term in the function. In order to 
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obtain better models for all objective variables, the minimum one is selected 
for a{. The sum of eigenvalues is one of the indicators to show how data 
are scattered. The reason of using the maximum one for rji is that we want 
to develop individual linear model in a smaller region. 

In this paper, we omit to describe the clustering algorithm which is 
similar to the ordinary fuzzy clustering algorithm. 

4. Numerical Examples 

A numerical example is shown in the following. 




In Figure 1 , the data are located close to a plane 2Xi X^ + 1 = 0 

in the three dimensional space. The data have an error value S E [~1, +1] 
produced by randomization respectively. Therefore the data are on a plane 
2Xi — ^2 — X3 + 1 = J actually. 

For such a case, it is difficult to detect two separate clusters because 
two linear equations expressing two planes are similar. Suppose Xi is the 
response variable and X 2 and X3 are the explanatory variables and they 
also are variables for which distances between data points are considered. 
The total number of data points is 64. The desirable number of the clusters 
is 2. 

We can obtain a successful clustering result presented by two kinds of 
points in Figure 2 by the above mentioned method. The two kinds of 
points are determined by the element of the fuzzy partition matrix J7, and 
the detail value {uik} E U are plotted in Figure 3, Let the white points 
denote cluster 1 and let the black points denote cluster 2. The detail of the 
result is shown as follows: 
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Figure 2: Result of Data Partition. 



Result of Numerical Example 

Cluster 1 (White Points) 

Xi = -0.8408 + 0.6357X2 + 0.3971X3 
Final Parameters : ai = 0.7972, r/i = 1.538 

n 

Errors : '^{uik)'^Eik = 0.6624 

k=l 

n 

Distances : = 39.46 

Ar=l 

Cluster 2 (Black Points) 

Xi = -0.3613 + O.43I6X2 + 0.4348X3 
Final Parameters : 02 = 0.7067, 7^ = 1.055 

n 

Errors ; = 0.8235 

*=i 

n 

Distances : ^^{u^kY^D^k — 15.46 

k=i 
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Figure 3: Grade of Membership Function for Cluster 1. 



5. Conclusion 

In this paper, a dynamic determination procedure of mixing parameters 
in fuzzy clustering is proposed. These parameters have a role to make a 
balance between the data scattering and the prediction errors. In building 
a fuzzy model, we can use the clustering information to identify membership 
functions of premise variables. 
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Abstract: This paper presents a dynamic clustering model in which clusters 
are constructed in order to find the features of the dynamical change. 

If the similarity between the objects is observed depending on time or pa- 
rameters which are satisfying the total order relation, then it is important to 
capture the change in the results of clustering according to the change in time. 
In this paper, we construct a model which can represent dynamically changing 
clusters by introducing the concepts of conventional dynamic MDS (Ambrosi, 
K. and Hansohm, J., 1987) or dynamic PCA (Baba, Y. and Nakamura, Y., 
1997) into the additive clustering model (Sato, M. and Sato, Y., 1995). 

Keywords: 3- Way Data, Dynamic MDS, Clustering Model. 



1 Introduction 

INDSCAL (Carroll, J.D. and Chang, J.J., 1970) or INDCLUS (Carroll, J.D. 
and Arable, P., 1983) are typical methods for data with similarities between 
the objects through time. 

Dynamic MDS has been discussed for the dissimilarities which are mea- 
sured at each of T time periods. The essential meaning of the Dynamic MDS 
is to capture the insightful, changing nature of the relationship among the 
objects with respect to time by plotting their path over time. Suppose D as a 
super-dissimilarity matrix, that is. 



■ £)(“) 


£1(12) 


£)(13) . 


■ • 


£)(21) 


£)(22) 


£)(23) . 






£)(T2) 


£)(T3) . 


. . £)(2’2’) 



where = (5^^) is the dissimilarity matrix at tth time. The element Slj 
of the matrix \{t ^ t') is the dissimilarity object r at the ^th time with 
object s at the t th time. The purpose of this method is to analyze the super- 
dissimilarity matrix in the usual manner of metric or nonmetric multidimen- 
sional scaling. Dynamical PCA is also applied to the super-matrix formed by 
the correlation matrices. 
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By introducing the idea of the construction specified by a super-dissimilarity 
matrix into the additive fuzzy clustering model, we propose a dynamical ad- 
ditive fuzzy clustering model. 



2 Dynamic Additive Fuzzy Clustering 

2.1 Additive Fuzzy Clustering Model 

An additive fuzzy clustering model is a model which introduces the concepts of 
fuzzy clusters and aggregation functions into ADCLUS. This model is defined 
in ( 1 ) and ( 2 ) according to the division between symmetric and asymmetric 
similarities. 

K 

( 1 ) Symmetric similarity {sij = Sji): Sij = ^p{uik,Ujk) £ij 

k=l 

( 2 ) Asymmetric similarity {sij 7 ^ sji): 

I K K K 

(2-l)5jj — ^ ^ ^ “ 1 “ £ij ^ d" Sij- 

^ k=l£=l k=l 

In these models, K denotes the number of clusters and it is assumed that 
Uik represents the fuzzy grade of which object i belongs to cluster k. Usually 
Uik belongs to an unit interval, that is Uik € [ 0 , 1 ]. p{uik,Ujk) is a symmetric 
aggregation function which satisfies the following conditions: 

1 . 0 ^ pi^Uik^ Uji^ ^ 1 , 0 ) — 0 , l) — Uik 

2 . p{Uik,Uji) < p{Usk,Uti) whenever Uik < Usk.uji < uu 

3. piuikfUji) = p{uji.) Uik) 

In particular, t-norm (Menger, K., 1942) is well-known as a concrete example 
which satisfies the above conditions. Originally, ^-norm has been defined by 
K. Menger 1942 (Ref. B. Schweizer and A. Sklar 1983) as a function satisfying 
a triangle-like inequality in a statistical metric space. Supposing F{x) to be a 
probability distribution function, Menger has proposed the distance between 
two points p and ^ in a set S is defined as a probability, Fpq{x), that is, for 
any real number x, Fpq{x) is interpreted as the probability that the distance 
between p and q is less than x. To introduce the metric property, he defined 
the inequality Fpr{^ + ^) > t{Fpq{x), Fqr{y)) for all p,q,r G S and all real 
numbers x,y. In this inequality, the function ^(•,*) : [0,1] x [0,1] [0,1] is 

referred to as a ^-norm satisfying the above conditions 1 , 2 , 3 and: 

4. t{t{x, y), z) = t{x, t{y, z)). (Associativity) 

Most of the t-norms are defined by the use of a generator function as follows: 
Suppose f{x) is a generator function of ^norms, which is a continuous mono- 
tone decreasing function under these conditions: 



/:[0 ,l]^[0 ,oc], /(l)-0. 
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Then t{x,y) is defined by 

t{x,y) = f^~^\f{x) + f{y)), 



where 



/ 

l 0, 



^G[0,/(0)) 

e [/(0),oo] ’ 



i.e. /! is the pseudo inverse of /, and / * is the usual inverse of /. 

7 («,fc, Ujk) is an asymmetric aggregation function and is defined as follows: 



7{X, y) = /’ ^Kf{x) + <t>{x)f{y)). 

Suppose f{x) is a generating function of t-norm and (f){x) is a continuous 
monotone decreasing function satisfying 



(/> : [0,1] [l,oo], 0(1) = 1. 



2.2 Dynamic Additive Fuzzy Clustering Model 

The data is observed by the values of similarity with respect to n objects for 
T times, and the similarity matrix of tth time is shown by Then 

a,Tn xTn matrix S is denoted as follows: 



5(1) 5(12) 5(13) . . . 5(it) - 

5(21) 5(2) 5(23) , . . 5(2T) 

5(ti) 5(T2) 5(T3) ... 5(T) 



( 2 . 1 ) 



where the diagonal matrix is the n x n matrix is an n x ra matrix 

and the element is defined as 






where s-]' is (i,j)th element of the matrix 5*'*. m(x,y) is an average func- 
tion from the product space [0, 1] x [0, 1] to [0, 1] and satisfies the following 
conditions: 

(1) mm{x,y} < m{x,y) < max{x,y} 

(2) m{x,y) = m{y,x) :(symmetry) 

(3) m(x, y) is increasing and continuous 

Moreover, from (1), m{x,y) satisfies the following condition: 

(4) m(x,x) = X :(idempotency) 

The examples of the average function are shown in Table 2.1. 
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Table 2.1. Example of the Average functions 



harmonic mean 






x + y 


geometric mean 




arithmetic mean 


x + y 
2 


dual of geometric mean 


1 - 7(1 - x)(l - 2/) 


dual of harmonic mean 


x + y- 2xy 




2 — X — y 



For the element Sij of 5, we propose a new fuzzy clustering model which is 
named the dynamic additive fuzzy clustering model and is as follows: 

K 

Sij l<i,j<Tn, i = n{t - 1) + (2.2) 

k=l 

where K is a number of clusters and shows a degree of belongingness of 
an object i to a cluster k for time t, and 1 < < n, 1 < ^ < T. 



3 Dynamic Additive Fuzzy Clustering Model 
for Asymmetric Similarity Data 

If the observed similarity is asymmetric, then the model (2.2) cannot be used. 
In order to apply the asymmetric data to the model (2.2), we have to recon- 
struct the super-matrix (2.1). Suppose the upper triangular matrix of is 
denoted as and the lower triangular matrix is Then we assume the 
super-matrix as: 



S{\L){\U) g{\L) S{IL){W) 

g{2U){\U) g{2U){\L) g{2U) 



g{\U){TU) g{W){TL) - 
g{\L){TU) g{lL){TL) 
g{2U){TU) g{2U)(TL) 



giTDiW) g(TL){\L) g{TL){2U) ... g{TL){TU) g{TL) J 

where 5-j^^), and is (ij)th. element of the matrix 

^(rc/) model (2.2) is rewritten as 



K 

Sij = p{Ui(c)djc^Uj{c)di^)^ ( 2 - 3 ) 

A:=l 



1 < ij < i = n{c - 1) -1- 1 < < n, 1 < c < 2T. 

Sij is (z, j)th element of S and unc)di, is a degree of belongingness of an object i 
to a cluster and c and d are suffixes for representing time and the direction 
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of the asymmetric proximity data respectively. The relationship is shown as 
follows: 

, _ J c = 0 (mod 2) , d — L and t = f 

~ I c = 1 (mod 2) , d — U and t — ^ ’ 

where I < t < T. L and U show the lower and upper triangular matrixes, 
respectively, that is, asymmetric relationship in the same time, t shows tih. 
time, and c = 0 (mod 2) means that c is congruent to 0 modulo 2. 

4 Numerical Example 

To demonstrate the applications of model (2.3), we will use data related to a 
number of high school students who moved from one prefecture to another for 
matriculation (School Basic Survey, Ministry of Education in Japan, 1992). 
We define the number between two prefectures i and j as Sij. Usually, the 
number in which the students moved from prefecture i to j is different from 
the number from prefecture j to i, so we can write Sij 7^ Sji, (i / j). For 

considering the total number of students for each prefectures, we used the 

following transformation: 

n 

— ^ij • (^ ^ j) (^-1) 

j=l 

Self-similarity Su was avoided in this numerical examination. Using (4.1), we 
can regard this as an asymmetric similarity data. In the optimization algo- 
rithm used in this example, 20 sets of initial values are given by using uniform 
pseudorandom numbers in the interval [0, |], and in the end we select the best 
result. The number of clusters is determined to be 3, which is based on the 
value of fitness. 

We have proposed a clustering method for 3-way data (Sato M. and Sato 
Y., 1997). This method is to get the optimum clusters which minimizes ex- 
tended within-class dispersion and shows the difference on times as weights of 
clusters for every time. The criterion of the method is defined as 

T 

F(U,w) 

t=l 

where is the goodness of clustering in the ^th time which is given 

by the least square error. 



i=lj=l lc=l 

is an asymmetric similarity between objects i and j at ^-th time. The 
value of the membership functions of K fuzzy subsets, namely, the degree of 
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belonging of each object to the cluster are denoted by 

K 

U — (Ui/j), Uifc ^ 0, ^ ^ y>ik f ? 

k=l 

z = 1, 2, • • • , n, fc = 1, 2, • • • , A". 

And we denote the weight of the cluster at ^-th time as (t = 1, 2, • • • , T; k = 
1, 2, • • • , A"). The result of weights is shown in Table 4.1. 

Table 4.1: Weights of Clusters for Times 





Time | 


Cluster 


Ti 


T2 


73 


Cl 


0.98 


0.98 


0.98 


C2 


0.99 


0.99 


0.99 


Cs 


0.67 


0.59 


0.25 



According to the result, we can find that cluster 3 will move through times. 
Then we used the prefectures which are consisted of cluster 3 and again clas- 
sified them to three clusters using model (2.3), in order to find the exact 
movement of these prefectures for the determined clusters. The results are 
shown in figures 4.1 to 4.6. In these figures, the abscissa shows the prefectures 
and the ordinate shows the grade to each cluster. From figure 4.3, cluster 3 
is consisted of prefectures which are located in the central part of Japan and 
the central prefecture is Osaka. However, in time 2, Osaka and Okayama pre- 
fectures which have large capacity of students moved to cluster 2. In time 3, 
Okayama prefecture is still in cluster 2, but Osaka prefecture moved again to 
cluster 3. From this consideration, the large two prefecture’s movement dur- 
ing the period might be caused by some infiuences which determine the unique 
clusters through the times. 




Fukui Shiga Kyoto Otaka Hyogo Mara Wakayama Okayama Tokushima 



Prefecture 

Figure 4.1: Dynamic Changes of Cluster 1 on the Upper Matrix 





Grade Grade . . _ . Grade 
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Figure 4.2: Dynamic Changes of 
Cluster 2 on the Upper Matrix 




Figure 4.3: Dynamic Changes of 
Cluster 3 on the Upper Matrix 





Figure 4.5: Dynamic Changes of 
Cluster 2 on the Lower Matrix 




Cluster 3 on the Lower Matrix 



Figure 4.4: Dynamic Changes of 
Cluster 1 on the Lower Matrix 
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5 Conclusion 

The dynamic fuzzy clustering model is an efficient technique of structural 
analysis for time dependent multivariate data. This method is effective for 
describing the dynamic structural change of clusters. Technically, this model 
amounts to dynamic MDS which is obtained by the super-matrix construction. 

However, in the case of the dynamic additive fuzzy clustering model, the 
coordinate of the result of the fuzzy grade makes no sense mathematically, 
due to the definition of the aggregation function family. The merit for using 
the idea of the super-matrix is to obtain the grade of belongingness to fixed 
clusters through time. In the super-matrix, if cross time period dissimilarities 
are all assumed to be 0, then the usual evaluation function for the classifica- 
tion techniques of 3- way data will be obtained. (Sato, M. and Sato, Y., 1994, 
1997) However, with these methods, clusters are not invariant with respect to 
numbering of the clusters, so it is impossible to get the latent phenomenon of 
the mixed observations. 

Determination of the average functions may be related to the representa- 
tion of the asymmetric proximity between objects or the factors of investigation 
over time which are dependent on the properties of the observations. Choos- 
ing the adaptable function corresponding to the relationship among objects or 
times will be discussed through some numerical investigations. 
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Abstract : In this paper, we present a new approach to unsupervised pattern 
classification, based on fuzzy morphology. The different modes associated to 
each cluster are detected by means of a fuzzy morphological transformation 
applied to a membership function defined from the mode concavity properties. 
The results of this clustering method are shown using artificially generated data 
sets. 

Key Words : Fuzzy Morphology, Fuzzy structuring function. Mode detection, 
Pattern recognition, Probability density function. 



1. Introduction 

Determining the structure of multidimensional data is an important problem in 
exploratory data analysis. Most of the clustering procedures, developed in the 
sixties, were based on some similarity or dissimilarity measures [1]. More 
recently, some degree of fuzziness have been introduced in the metric criterions 
in order to cope with the irregularities of the data distribution [2]. 

Another way to classify the observations is to analyse the local density function 
of the pattern distribution in the data space [3]. Note that many statistical 
procedures aim at the detection of the modes of the probability density function 
underlying the distribution of the data [4]. These modes can be considered as 
regions of the data space with high local concentration of observations, 
separated by valleys, which are regions characterised by low local concentration 
of observations. Mathematical morphological tools have proved to be well- 
suited for discriminating between these two kinds of regions [5] [6]. In this 
paper, we improve such morphological clustering techniques by introducing 
fuzzy concepts in the basic erosion and dilation operations [7]. As fuzzy 
mathematical morphology has been built on fuzzy sets, we first show how a 
multidimensional data set can be transformed as a fuzzy set, with its associated 
membership function. Then, we present a fuzzy morphological transformation, 
which directly detects the different modes associated to each cluster. Finally, the 
whole clustering method is illustrated using artificially generated data sets. 
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2. Evaluation of the membership function to a mode : the 
fuzzification phase 

In order to detect the different modes associated to each cluster, we first 
evaluate, from the data themselves, a membership function to a mode of each 
observation, or point, of the analysed set. 

As many other mode detection procedures working in an unsupervised context, 
we have first to estimate the underlying probability density function (pdf) 
related to the distribution of the observations. In the k nearest neighbours (knn) 
environment, the pdf is estimated at each point of the data space, by considering 
the smallest domain, centred at this point, which contains a fixed number ki of 
neighbours. To be more specific, the ki nearest neighbours estimator, denoted 
by p(Xm), is computed at each N-dimensional observation Xm of the analysed 
set X of size M such as : X = {Xi, X2, Xm, ..., Xm}. At the end of the 
estimation step, which takes advantage of the fast algorithm presented in [8], 
the estimator p (Xm) is evaluated of each point Xm by : 

\ ki / M 

where V[D(Xm)] denotes the volume of the smallest domain in which fall the ki 
nearest neighbours of the multidimensional observation Xm- 

The membership function to a mode can then be evaluated by considering each 
mode as a region of the data space where the pdf is concave. This approach is 
based on a test which determines locally the convexity of the multivariate pdf 
by assigning the “concave” or “convex” label to each point of the analysed set 
[9]. In a knn environment, the local convexity of the pdf at a point Xm is 
determinated by analysing the variation of the knn estimator when the domain 
grows around Xm. More precisely, we fixe a number k2 of neighbours (k2 > ki), 
and we grows the domain Vi[D(Xm)] until it includes the number k2 of 
neighbours in order to yield the domain V2[D(Xm)]. Then, the two estimators 
p i(Xm) and p2(Xm) are evaluated with the equation (1). 

If V2[D(Xm)] > Vi[D(Xm)] implies that p2(Xm) > pi(Xm), then, the point Xm 
will have the convex label. Otherwise, Xm will have the concave label. 

As a mode can be defined by a set of points characterised by the concave label, 
we introduce, here, a confidence value on the event : “Xm is a modal 
observation”, from a Pi-fuzzification function (figure 1), where the belief degree 
on this event depends on the number of the “concave” observations in the 
neighbourhood of the studied point. So, the more numerous the nearest 
neighbours of the studied point with a concave label are, the more important the 
belief degree on the event “Xm is a modal observation” is. 
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Figure 1 : The membership function 
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At the end of this fuzzification step, the fuzzy set is composed of the initial set 
X with its associated mode membership function \Im- 



3. Mode detection by a new fuzzy morphological transformation 

In classical morphology, the basic operations of erosion and dilation are often 
combined in pairs ; the resulting operation, namely an opening, is well-known 
for its filtering properties. Indeed, the principal effect of the erosion is to 
enlarge the valleys, in eliminating the irregularities of the distribution, but it 
also tends to shrink the modes. On the other hand, dilation is used to enlarge the 
modes, but it also tends to enhance the valleys. So, it would be more interesting 
to erode, or to dilate, more or less, depending on the localisation of the datas in 
the representation space by means of fuzzy operators. Indeed, in the definitions 
of the fuzzy erosion and the fuzzy dilation presented respectively in (2) and (3), 
the membership function |lIm is compared to some structuring function v, which 
is defined on a domain S. Note that S is, here, an hypersphere including a 
number kf of neighbours. The definition of the function v depends on the 
problem to solve. 

Ev[jUM][Xm] = inf max [//«(X4 (2) 

Xki G S 

D = sup min [;Um(X«), v(x*i - Xm)] (3) 

Xki G S 

As our problem is to enhance the modes while deepening the valleys, we first 
used binary structuring functions depending on the mode membership function 
|Im, only evaluated at the studied point Xm [10]. The promising results of such a 
procedure can be improved in taking into account the likeness between the 
studied point Xm and its nearest neighbours Xki. Indeed, the confidence value on 
the event “Xm is a modal observation” should be higher when a great number of 
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its neighbours have themselves high confidence values on this event. As well, if 
the nearest neighbours have low confidence values, the confidence value of the 
studied point must be lower. So, in this paper, the binary structuring functions 
used for erosion and dilation will depend on both p.M(Xm) and |LiM(Xki). 

To be more specific, the structuring function \)e, used in the fuzzy erosion of the 
equation (2), must be defined such as this operation will enlarge the valleys 
while not deepening the modes. This function OeCXm-Xki), evaluated at each 
neighbour Xki of the studied point Xm, will affect different confidence values to 
the neighbours of Xm, according to a binary criterion depending on the product 
|LiM(Xm).|LiM(Xki) (figure 2). These confidence values must be as more important 
than Xki and Xm are both probably located at the valleys, so than the product 
|LlM(Xm).|lM(Xki) is low. 

As well, the structuring function Ud, used in the fuzzy dilation of the equation 
(3), must enlarge the modes while not enhancing the valleys. So, we will use a 
binary criterion which will affect an important confidence value to each 
neighbour Xki of the studied point Xm when they are both probably located at 
the modes, i.e. when the product |iM(Xm).|LtM(Xki) is high. 



Figure 2 : Representation of the binary criterion for the erosion and dilation 



Magnitude of the 
structuring function 
at Xki: 

yy(x„,-XH) 



1 



In that way, the mode detection algorithm consists in performing a fuzzy 
morphological transformation, which is defined by a fuzzy erosion using the 
specific structuring function Ve followed by a fuzzy dilation using the specific 
structuring function Vd. The different modes are directly detected by iteratively 
applying this transformation until stabilisation of the resulting function. 



4. Example of this clustering method 

The important difficulty in cluster analysis are non-spherical clusters. The 
bivariate data set of figure 3, has been generated keeping these well-know 
difficulty in mind. Indeed, a gaussian cluster is flanked by two non-spherical 
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clusters. Figure 4 presents the resulting mode membership function evaluated 
after the convexity test, performed on the estimated pdf. After the fuzzy 
morphological transformation presented in this paper, the 3 detected modes are 
well-separated (figure 5). 

When the observations constituting the different modal regions are identified 
(figure 6), they can be considered as the prototypes of the clusters. Each 
remaining observation is then assigned to the cluster attached to its nearest 
neighbours among these prototypes. The result of such a classification 
technique (figure 7) presents an error rate equal to 8.3 %, which can be 
compared to the Isodata one which is equal to 31.3%. 



Figure 3 : Raw data set 




Figure 5 : Result of the fuzzy 
morphological transformation 




Figure 4 .• Membership function 




Figure 6 : The different cores 











HIP 


- ■ /T “ " 










4J1 










+ *■' *■* 















Figure 7 : Result of the classification 
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6. Conclusion 

We have presented a classification procedure based on the fuzzy morphological 
mode detection of the pdf underlying the distribution of the data. As fuzzy 
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morphological operators have to be performed on fuzzy sets, we have first 
defined a mode membership function, using the concavity properties of the 
modes. These different modes are then directly detected by iteratively applying a 
fuzzy morphological transformation to this fuzzy set, which combines fuzzy 
erosions and fuzzy dilations. Finally the classification phase consists in assigning 
each observation to its nearest mode. 

The fuzziness notion has been introduced in the morphological operators in order 
to take into account the local structure of the data. With such fuzzy operators, 
defined using specific fuzzy structuring functions adapted to our mode detection 
problem, we keep advantages of the classical morphological operators, without 
the withdraws. Indeed, the fuzzy transformation, presented here, yields directly 
well-separated modes, with respect to the original shape of the data. 

Face to the promising results to such fuzzy techniques, authors are now working 
in the development of different fuzzy morphological procedures including the 
local variations of the mode membership function. 
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Abstract: In the best conditions, the time complexity of classical 

algorithms of hierarchical ascendent classification is for a number n of elements 
to be classified of order O(n^). In this paper the proposed CAHCVR algorithm 
is based on the principle of the "reciprocal neighbourhood" algorithm 
(CAHVR). It takes into account a contiguity constraint defined by a given 
connected graph. We study the properties of its results (absorbing class, 
inversion,...) and demonstrate that the average time complexity is in 0(n) 
(linear complexity). The proposed CAHCVR algorithm is applied to image 
analysis in order to solve the image segmentation problem. 

Keywords : hierarchical ascendent classification, connected graph of 

contiguity, reciprocal neighbourhood, image segmentation. 



1. The CAHCVR algorithm 

The proposed CAHCVR algorithm includes several steps, each of which 
constituted of two distinct phases. The first phase consists in finding all the 
couples of adjacent elements, according to a given connected graph, such that 
each element of the couple is the closest (in the sense of inertia) element to the 
other. These couples are called contiguous reciprocal neighbours (RN). The 
second phase (i) aggregates all the couples of contiguous RN and (ii) forms a 
new class for each of them. Both phases are performed alternatively until all the 
elements are grouped into a same class. The update of the contiguity condition 
is performed only at the end of the two previous phases ; in other words, at the 
end of each step in the CAHCVR algorithm. The new graph will be formed by, 
first, new vertices corresponding to the elements which have not been 
incorporated. The edges of this new graph are defined in the following way : if 
vertex x is adjacent to vertex y (in the sense of the graph) all vertices which 
contain y are also adjacent to all vertices adjacent to either one of the two 
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incorporated vertices. Such an update conserves the connexity nature of the 
graph ; the connexity insures, in particular, the convergence of the algorithm, in 
other words it will always be possible to gather all the elements into one class 
(see proposition 1 ). The CAHC algorithm [Lebart, 1978] performs, at each 
step, one aggregation only. This aggregation corresponds to the pair of elements 
which are adjacent according to the graph and minimizes the distance (based on 
an inertia criterion, for example). The update of the contiguity is similar to that 
of the CAHCVR algorithm. The two algorithms, CAHC and CAHCVR, do not 
produce equivalent results [Bachar, 1994]. 



2. Theoretical properties of CAHCVR algorithm 

Proposition 1 A necessary and sufficient condition for CAHC and CAHCVR to 
end in a connected classification tree, is that the initial graph of contiguity is 
connected. 

Proof : obvious using the following update of contiguity.* 

Theorem Let be any initial connected graph of constant maximal degree 
and independent of n (denoted vj. Suppose that the "distances" D between an 
element and its contiguous neighbours (in the sense of G^) are independent 
non-negative random variables of same continuous probability law. For the 
CAHCVR algorithm , the average proportion of contiguous reciprocal 
neighbours (RN) at step k varies in the interval ] 1/4 , 3/8] , except in the last 
step where this proportion is equal to the maximum value 1/2, and assuming 
that the graph can be divided into / 1 classes of contiguous couples, where n^ 
is the number of elements at the step k. If the initial graph Gq is complete, the 
previous interval is ] 1/4 , 1/3]. 

Proof : We have to calculate the probability p that two contiguous elements 
(x,y) will form an RN couple ; we suppose that n>2 (p = \ if n = 2). We 
denote by P the probability function (using the notations and the hypothesis of 
the theorem), and then P{D < p) = F(p) is the cumulative distribution function, 
/(p) will designate the probability density function (p>0). First, we specify 
the law G(p) (g(p) is the corresponding probability density) of the distance 
between x and its closest contiguous element y among v(x) elements 
contiguous to x (x,,X2,...x^^^ are the contiguous elements to x). Next we 
specify the conditionnal probability Hfp) that x is the closest element to y 
(with the condition that y is the closest element to x ; (Ti,T2’* *Tv(v) 
contiguous elements to y ). 

P{D(x,y) > p) = P(f| {D(x,x , ) > p }) = (1 - F(p))''<^> G(p) = 1 - (1 - F(p))''<'^> 
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^^(P) = ^(n P}) = 0-'^(P))'’‘'’"'- Then the probability p^{y) 

Vj *X 

that (x,y) is a couple of contiguous RN starting from x in the graph Gq is 

00 

j^y(P)^(p)<^P + The probability p verifies : 

0 

'^P = Pxiy)^Pyi^) and then /? = (v(x)4- v0;))/2(v(x) + v(y)-l). As 
v(x) + v(j;)>3 (the graph is connected and n>2) we have the result 
l/2</?<3/4. Then the average number of contiguous RN couple belongs to 
interval /4,3«^ /8 ] where n,. is the number of elements at step k ; in the 
last step the number of elements is 2 and p = \. The case of a complete initial 
graph Gq corresponds to the CAHVR algorithm (without contiguous constraint) 
and then v(x) + v(y) > 4 (for n>2) : therefore \/2<p<2/3M 
The result [Benzecri, 1982] appears to be a particular case of our theorem 

Proposition 2 According to the hypothesis of the previous theorem, the average 
time complexity of the CAHCVR algorithm is in 0(n). 

Proof : We start (i) by estimation the average proportion of contiguous 
elements and (ii) by computing the size of calculations in CAHCVR. 
Afterwards we estimate the average number of distances to compute and the 
average number of comparisons necessary in order to identify the contiguous 
RN. We denote by n the large number of elements to be classified and by 0. the 
proportion of contiguous RN couples at step i (1 < / < AT where K is the (finite) 
number of the last step ; 0 < 0. < 1 / 2 ). We have «(1 -0j)(l-02)...(l-0^) = 1. 
If 0 denotes the average proportion of contiguous RN couples, we have 

«(l- 0 ,)(l- 02 )...(l- 0 J = l = «(l-0)^ (/^ = -log(«)/log(l- 0 )>log 2 (Az)) 

Subsequently, we use this average proportion 0 . 

The average number of distances to compute. At the begining of step 0, there 
are n elements and we perform less than nv^ / 2 distances ( is the constant 
maximal number of the contiguous elements to any initial element). At the end 
of step 0, an average number n-nQ = w(l -0) of elements remain. At the start 
of step 1 we perform less than «(1 -0 )Vq /2 distances. At the end of step k, the 
average number of remaining elements is /?(l-0)^ and we perform less than 
n{\ - 0)^ Vq / 2 distances. Then we perform a maximal number of distances equal 

K 

toT>™ax(e) = I«(l-0)\/2.D™,(9) « nv^ / (20) is decreasing with 0 . 

k=o 

The average number of comparisons necessary in order to identify the 
contiguous RN couples. At the begining of first step we perform less than nv^ 
comparisons to identify the closest elements to any element. In the start of the 
number k (A:>I) we perform a number of comparisons less than equal to 
«(1“0)^ '^ 0 * To identify the whole set of RN couples, we need to 

jt>i 
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perform less than C 2 ( 0 ) = ^ «(1 -9)^ ' comparisons (it is easy to prove that 

A>i 

the average number of contiguous elements of any element is low or equal at 
step k ). The CAHCVR algorithm requires, in average, less than 
C,(0) + C2(9)«9 vo(1 + Vq)« comparisons and /(20) distances. 

C,(0) + C 2 ( 0 ) + D,^3^(9) is a linear function in n, because 0 is independent of n 
(following theorem).* 

Proposition 3 According to the hypothesis of the previous theorem, the 
CAHCVR algorithm facilitates (in probability) the formation of contiguous 
classes whose neighbourhoods are the smallest as possible. All possible 
incorporations of couples of contiguous elements are equally probable in the 
case of the CAHC algorithm. 

Proof : The following expression p = (v(jc) + v(y)) / 2(v(jc) + v{y) - 1) (following 
the theorem) is decreasing with respect to v(x) + v{y) : therefore, the CAHCVR 
algorithm facilitates (in probability) the formation of contiguous classes whose 
neighbourhoods are the smallest possible. This is not the case in the CAHC 
algorithm : indeed, if is the number of edges at step k, the probability q that 

two contiguous elements jc et y form a class at the step is 

00 

^ = |(1 -F(p))'”* /(p)i/p = 1/w^. : then following the same hypothesis, all 

0 

possible incorporations of couples of contiguous elements are equally probable 
in the case of the CAHC algorithm.* 

An absorbing class is a class that is created by successive aggregation of lower 
size classes. However, a lack of balance between the resulting aggregated 
classes is often observed (for a given partition). In practice, to avoid this lack of 
balance, an arbitrary constraint on class sizes is introduced (it is the case in the 
CAHC algorithm). In [Bachar, 1994] we demonstrate that the absorbing class 
increases (in probability) the presence of inversion. Proposition 3 highlights the 
inconvenience of absorbing classes ; it should be recalled that inversion 
phenomenon complicates the automatic choice of an "optimal" partition and 
deteriorates the final "quality" of the obtained result. 



3. The likelihood of the maximal link criterion 

This criterion of similarity between classes refers to a probability scale. This 
scale can be directly associated with a dissimilarity measure between classes in 
the form of a quantity of information. Following a very general construction, 
such a criterion can be applied both to cluster the set of variables and the 
described objects set ; and this is the case whatever what the mathematico- 
logical nature of the data. The established inequalities permit us to understand 
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that with the criterion of the likelihood of the maximal link [See Lerman, 
1991] , it is much more difficult for inversion to take place than in the case of 
explained inertia (criterion of inertia). In addition, the phenomenon of the 
absorbing classes decreases. These properties will be studied in a future paper. 



4. Applications and conclusion 

The aim of segmentation of a plane image is the identification of bounderies 
between the objects that form the image. The notion of contiguity is then natural 
here : the pixels are on a plane. The segmentation of an image is a partition 
which can be obtained after classification under contiguity constraint. This 
constraint can be done directly in a classification algorithm, in addition to using 
the attributes which characterize the measuring points. It can also intervene in a 
previous factor analysis, or in both cases. The choice of the variables which 
characterize a point of the image defines the segmentation criteria. The 
treatment proposed is applied to several real medical images of radiology. We 
use the original image (« = 512 x 512 pixels) describing an abdomen in fig.l. 
The image segmentation in fig.2 (partition in 70 classes using the CAHCVR 
algorithm with 3 statistical attributes characterizing the measuring pixels and 
with Vq = 8) is obtained in 129 seconds (HP9000 Computer, HPUX V.IO). The 
time evaluation is displayed in fig.3 (for different values of the number n of 
pixels : 50x50, looxioo, I50xi50,..., 500x500). The linearity of the time complexity 
and the quality of results are verified... 

Acknoledgments : We thank Johanne Bezy-Wendling and Fabrice Wendling 
(LTSI) for the acquisition and evaluation of the images. 
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Figure 1 : Original image of the 
abdomen, acquired with an helical CT 
(Siemens Somatom Plus 4, South 
Hospital, Rennes, France) 



Figure 2 : Result of the CAHCVR 
algorithm applied to the image of 
Figure 1. Several organs and bone 
structures are quite well identified 
(vertebra, liver, kidneys, fatty tissue...) 
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Figure 3 : Time evaluation (seconds) for different sizes of image 



Constrained Clustering Problems 

Vladimir Batagelj^ Anuska Ferligoj^ 

^ University of Ljubljana, Faculty of Mathematics and Physics, and 
Institute of Mathematics, Physics and Mechanics, Dept, of TCS, 
Jadranska 19, 1 000 Ljubljana, Slovenia 
^ University of Ljubljana, Faculty of Social Sciences, 

PO. Box 47, 1 109 Ljubljana, Slovenia 
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1. Introduction 

For constrained clustering, grouping similar units into clusters has to satisfy some 
additional conditions. This class of problems is relatively old. One of the most 
frequently treated problems in this field is regionalization: clusters of similar ge- 
ographical regions have to be found, according to some chosen characteristics, 
where the regions included in the cluster have to be also geographically con- 
nected. A number of approaches to this problem have been taken. The majority 
of authors (e.g., Lebart, 1978; Lefkovitch, 1980; Ferligoj and Batagelj, 1982; 
Perruchet, 1983; Gordon, 1973, 1980, 1987; Legendre, 1987) solve this problem 
by adapting standard clustering procedures, especially agglomerative hierarchical 
algorithms, and local optimization clustering procedures. The geographic conti- 
guity is a special case of relational constraint. Ferligoj and Batagelj (1982, 1983) 
first treated this clustering problem for general symmetric relations and then for 
nonsymmetric relations. It is possible to work also with other, non-relational 
conditions. Murtagh (1985) provides a review of clustering with symmetric rela- 
tional constraints. A more recent survey of constrained clustering was given by 
Gordon (1996). 



2. Formalization 

The (constrained) clustering problem can be posed as an optimization problem 
(Ferligoj and Batagelj, 1982, 1983): 

Let £ be a finite set of units. Its nonempty subset C C E is called a cluster. 
A set of clusters C = {Q} forms a clustering. 

The clustering problem (#, P, min) can be expressed as: Determine the clus- 
tering C* G for which 



P(C*) = minP(C) 
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where # is a set of feasible clusterings and P : $ Rq is a clustering criterion 
function. We denote the set of minimal solutions by Min(^, P). 

Let us introduce some notions which we shall need in the following. 

The clustering C is a complete clustering if it is a partition of the set of units 
E. We shall denote by II(£') the set of all complete clusterings of E. Two among 
them O = {{X} : X G £■} and I = {P} deserve to be denoted by special sym- 
bols. The set of feasible clusterings ^ can be decomposed into ’’strata” (layers) 

= {C G $ : card(C) = A;} 

Let (Rj, 0, 0, <) be an ordered abelian monoid. A simple criterion function 
P has the form: 

P(C) = 0 p(C), p[C) >0 and VX G P : p({X}) - 0 

CGC 

For almost all criterion functions used in applications, it holds also p{C\ U C 2 ) > 
p(C'i) 0 p(C' 2 ). For a simple criterion function satisfying this condition, it holds 
for k < cardP: VC G n^BC' G : P(C') < P(C). Since P(C) > 0 
and P(0) = 0, it holds that O G Min(II, P). To avoid this trivial problem we 
usually introduce the obvious constraint - we restrict the problem to 11^:, where 
A: is a given number of clusters. 

Not all clustering problems can be expressed by a simple criterion function. 
In some applications a general criterion function of the form 

© q{CuC2). q{CuC2)>Q 

{C\ 

is needed. An example of general criterion functions can be found in blockmod- 
eling (Batagelj, Ferligoj, Doreian, 1992; Batagelj, 1997) 

P{G-.T)^ E mmw{T)6{C^,C2\T) 

{Ci,C2)eCxc ^ 

where T is a set of feasible types, and 6 measures the deviation of blocks, induced 
by a clustering, from the ideal block structure. The blockmodeling methods con- 
sidering also (dis)similarities between units have still to be developed. The pro- 
posed optimization approach essentially expresses the constraints with a penalty 
function. 

Another such example is a problem of partitioning of a generation of pupils 
into a given number of classes so that the classes will consist of (almost) the same 
number of pupils and that they will have a structure as similar as possible. An 
appropriate criterion function is 

P(C) = max min maxd(X, /(X)) 

{Ci,C2}ecxc f:Ci^C2 XeCi ^ ^ " 

card Cl >card C 2 f is SUljeCtive 

where d{X, Y) is a measure of dissimilarity between pupils X and Y. 
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3. Types of Constrained Clusterings 

Various types of the constraints are discussed below. 

3.1. Relational Constraints 

Generally, the set of feasible clusterings for this type of constraint can be defined 
as: 



^{R) = {C G n : each cluster C G C is a subgraph {C,RnC x C) in the 
graph {E, R) with the required type of connectedness} 

We can define different types of sets of feasible clusterings for the same relation R 
if it is nonsymmetric (Ferligoj and Batagelj, 1983). Some examples of clusterings 
with (nonsymmetrical) relational constraint are 



type of clusterings 


type of connectedness 


^\R) 


weakly connected units 




weakly connected units that contain 




at most one center 




strongly connected units 




clique 




the existence of a trail containing 




all the units of the cluster 



A center of a cluster C in the clustering type is the set of units L C C iff 

the subgraph induced by L is strongly connected and R{L) C]{C\L) = 0 where 
R{L) = {y :3x e L : xRy}. 

3.2. Constraining Variables 

The set of feasible clusterings for this particular type of constraint is defined as 
follows (Ferligoj, 1986): 

^[a, 6] — {C G n : for each cluster C G C holds: vc G [a, b]} 

where vc is a function of values of the constraining variable, V, for the units in 
the cluster C. 

3.3. Optimizational Constraint 

The set of feasible clusterings for an optimizational constraint is defined as: 

^{F) = {C G n : for second criterion F the condition F{C) < f holds } 

where / is a given threshold value for the second criterion. 

A combination of all three types of constraints (relational, constraining vari- 
able and optimizational) can be considered simultaneously. 
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3.4. Pre-spedfied Blockmodels 

Given a network M, set of types T, and a model M (constraints!), determine 
fjL (clustering) which minimizes the criterion function (for details see Batagelj, 
Ferligoj, Doreian, 1998). 



4. Solving Constrained Clustering Problems 

With few exceptions the clustering problem is too hard to be exactly solved ef- 
ficiently. Therefore, approximative/heuristic methods have to be used. Among 
these, agglomerative (hierarchical) and local optimization (relocation) methods 
are the most popular. 

4.1. Hierarchical Algorithm 

The set of feasible clusterings $ determines iht feasibility predicate $(C) = 
C G ^ defined on V[V{E) \ { 0 }); and conversely $ = {C G V{V{E) \ { 0 }) : 
$(C)}. 

In the set of all clusterings ^ the relation of clustering inclusion C can be 
introduced by 

Cl □ C2 = VCi G Ci,C2 G C2 : Cl n C2 G { 0 ,Ci} 

we say also that the clustering Ci is a refinement of the clustering C 2 . 

It is well known that (II, □) is a partially ordered set (even more, semimod- 
ular lattice). Because any subset of partially ordered set is also partially ordered, 
we have: Let ^ C n then (^, □) is a partially ordered set. 

The clustering inclusion determines two related relations (on ^): 

Cl □ C2 = Cl E C2 A Cl / C2 and 

Cl EC2 = Cl □ C2 A -n 3 C G ^ : (Cl □ C A C C C2) . 

4.1.1. Conditions on the Structure of the Set of Feasible Clusterings. In 

the following we shall assume that the set of feasible clusterings ^ C II satisfies 
the following conditions (Batagelj, 1984): 

FI. The feasibility predicate $ is local - it has the form $(C) = Acec 
where p{C) is a predicate defined on V{E) \ { 0 } (clusters). 

The intuitive meaning of ^{C) is: (p{C) = the cluster C is ’’good”. There- 
fore the locality condition can be read: a ’’good” clustering C G $ consists of 
’’good” clusters. 

F2. O G 

F3. The predicate ^ has the property of binary heredity with respect to the 
fusibility predicate ip{Ci^C 2 ), i.e.. 

Cl, C2 ^ 0 A Cl n C2 = 0 A ip{Ci) A ^(C2) A ^(Ci, C2) ^ ^(Ci u C2) 

This condition means: in a ’’good” clustering, a fusion of two ’’related” clusters 
produces a ’’good” clustering. 
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F4. The predicate '0 is compatible with clustering inclusion □, i.e., 

VCi, C2 G $ : (Cl E C2 A Cl \ C2 - {Cl, C2} 0(Ci, C2) V 0(C2, Ci)) 

F5. The ''interpolation'' property holds in i.e., VCi, C 2 G : 

(Cl C C 2 A card(Ci) > card(C 2 ) + 1 3C G $ : (Ci □ C A C □ C 2 )) 

It is easy to verify that the sets of feasible clusterings ^*(i?), i = 1, 2, 4, 5 
from Ferligoj and Batagelj (1983) satisfy the conditions FI - F5. But, in the case 
of the property F5 fails (in general). The counterexample is given in Fig- 

ure 5 in Ferligoj and Batagelj (1983), for which we have {{1, 2}, (3, 4}, {5, 6}} E 

{{ 1 , 2 , 3 , 4, 5 , 6,}}. 

4.1.2. Criterion Function and Agglomerative Clustering. A dissimilarity 
between clusters is a function d : (Ci,C 2 ) Rj which is symmetric, i.e., 
d{Ci, C 2 ) = d{C 2 , Cl). Let (Rq , 0, 0, <) be an ordered abelian monoid. Then 
the criterion function P is compatible with dissimilarity d over ^ iff: 

VC C E : ((/?(C)Acard(C) > 1 => p(C) =::: min (p(Ci)0p(C2)0d(Ci,C2)) 

{Ci,C2)G^[C) 

Now we can state our main theorem: 

Theorem 1 Let P be a simple criterion function compatible with d over 0 
distributes over min, and FI - F5 hold, then 

P{Cl) = min P{C) = mm (P(C) 0 d{C,, C^)) 

V’(C'i,C2) 

From this theorem the following ’’greedy” approximation can be seen: 

min d{C^,C 2 ) 

V-(C'i,C'2) 

which is the basis for the following agglomerative (binary) procedure for solving 
the clustering problem : 

1. k := n{= card E); C{k) := O; 

2. while 3Ci, Cj G C(A:): (i j A 'ip{Cij Cj)) repeat 

2.1. (Cp, C,) argmin{d(C,, Cj): i^jA 0(C,, C,)}; 

2.2. C :=CpUCg\k:=k-l; 

2.3. C{k) := C{k + 1) \ (Cp, CJ U {C}; 

2.4. determine d(C, C«) for all C^ G C(A:) 

3. m := k 
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Note that this procedure is not an exact procedure for solving the clustering prob- 
lem. But, because of the nature of clustering problem (Garey and Johnson, 1979; 
Shamos, 1976; Brucker, 1978), it seems that we are forced to search for and to 
use such procedures. This procedure has also some nice properties: 

Theorem 2 All clusterings C(fc), /c = n, n — 1, . . . , m obtained by the described 
procedure are feasible. It holds C{k) £ k = n,n — and C(m) £ 

Max$. 



An agglomerative procedure is said to be compatible with $ iff: every clus- 
tering obtained by the procedure is feasible, and every feasible clustering can 
be obtained by the procedure if we can at each step fuse any pair of ’’related” 
clusters. For our procedure it can be shown: 

Theorem 3 If ^ satisfies the conditions FI - F5, then the described procedure 
is compatible with 

4.2. The Relocation Algorithm 

The basic scheme for an adapted relocation algorithm is: Suppose that a reflexive 
and symmetric neighborhood relation iV C $ x $ is given between feasible clus- 
terings. Usually, for clustering problems, N is determined by the following two 
transformations: moving a unit X from cluster Cp to cluster Cq {transition)', and 
interchanging units X and Y from different clusters Cp and Cq {transposition). 

1 . determine the initial feasible clustering C; 

2. while there exist C' £ N{C) such that F(C') < P{C) repeat: 

2.1. move to C': C := C' . 



The following two features are crucial in developing of an algorithm of this type: 
a method for randomly generating initial feasible clusterings; and an efficient 
procedure to scan the neighbourhood N{C). 

If the constraints are not too stringent, the relocation method can be ap- 
plied directly on otherwise, we can transform (penalty function method) the 
problem to an equivalent nonconstrained problem {Hk,Q,mm) with Q(C) = 
P(C) + aK{C) where a > 0 is a large constant and 



K{C) = 



0 $(C) 

> 0 otherwise 



There exist several improvements of the basic relocation algorithm: simulated 
annealing, tabu search, . . . (Aarts and Lenstra, 1997). 
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4.3. Dynamic Programming 

Suppose that Min($/k, P) ^ 0. Denoting P*{E, k) = P{Cl(E)) we can derive 
the generalized Jensen equality (Batagelj, Korenjak and Klavzar, 1994): 

f P{E) {E} e 

P%E, k) = mjn {r{E \ C, fc - 1) 0p(C)) k > 1 

[ 3ce*jt_i(£:\c):cu{c}6*fc(£;) 

This is a dynamic programming (Bellman) equation which, for some special 
constrained problems, allows us to solve the clustering problem by the adapted 
Fisher’s algorithm. For an application of dynamic programming to hierarchies 
see Lebbe and Vignes (1996). 

4.4. Multicriteria Clustering and Constraints 

In a multicriteria clustering problem ($, Pi, P 2 , • • • , min) we have several 
criterion functions P^t = 1, . . . , fc over the same set of feasible clusterings #, 
and our aim is to determine the clustering C G ^ in such a way that 
Pt(C) min, ^ 1, . . . , fc. 

In general, solutions minimal for distinct criteria will differ from each other. 
This creates the problem how to find the ‘best’ solution so as to satisfy as many 
of the criteria as possible. In this context, it is useful to define the set of Pareto 
efficient clusterings: a clustering is Pareto efficient if it cannot be improved on 
any criterion without sacrificing on some other criterion. A multicriteria cluster- 
ing problem can be approached in different ways (Ferligoj and Batagelj, 1992). 
It can be solved also by using constrained clustering algorithms where a selected 
criterion is considered as the clustering criterion and all other criteria determine 
the (optimizational) constraints. And conversely: a constrained clustering prob- 
lem can be transformed to a multiciteria clustering problem by expressing the 
deviations from constraints by penalty functions. 

4.5. Other 

Some other optimizational approaches for solving constrained clustering prob- 
lems can be found in Klauer (1994) and Hansen, Jaumard, and Sanlaville (1994). 



5. Conclusion 

In the paper we presented an overview of constrained clustering problems viewed 
through ’optimizational’ glasses. For details the reader should consult the refer- 
ences. Some related papers are available at 

http: //vlado.fmf .uni-1 j .si /pub/cluster/ 

Acknowledgment: This work was supported by the Ministry of Science and 
Technology of Slovenia, Project Jl-8532. 
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Abstract: A procedure for segmentation by a constrained hierarchical clustering 
algorithm is proposed, using a criterion (or response) variable X and k structural 
factors or predictors, which yields classes different mainly as to the (conditional) 
distributions of X, computed within each segment. Since the procedure works on 
combinations of factor levels (and only indirectly on individuals), the 
methodology can be employed even for very large populations, with no increase of 
computational complexity. 

Key words: Segmentation, structural factors, adjacency, constrained 

classification. 



1. Introduction 

Several methods have been proposed (Morgan et al.,1963, Celeux et al.,1982, 
Breiman et al.,1984, Quinlan, 1986) for generating a segmentation of a population 
by growing classification trees. The goal of such methods is to investigate the 
underlying data structure and to explain the behavior of a response variable from a 
given set of explanatory variables. The search for an optimal tree leads to 
recursive algorithms, which can be very complex for large data sets and very 
dependent on data (Aluja et al.,1996). 

In this paper, a procedure for segmentation is proposed, based on an approach 
which employs constrained agglomerative cluster analysis methods. The goal is to 
operate a partitioning of the cells of the A:-dimensional contingency table of the k 
predictor variables (referred to as structural factors) in contiguous classes of 
similar cells, according to a possibly multiple criterion or response variable. Such 
segments are structurally well-characterized, because cells which belong to each 
segment are contiguous and determine a “natural structure” (if any) to explain the 
distribution of the criterion variable. 

The algorithm proposed here uses the dissimilarity between the conditional 
distributions of the criterion variable computed within each cell of the A:-way 
contingency table. 

The procedure is summarized by a tree diagram, which displays the whole 
hierarchical agglomerative process. 
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2 . Notations and definitions 

Let k-^1 variables (Si,..., Sk, X) be observed in a population, where Si is the i-th 
structural factor and X is the variable with respect to which the classes will be 
discriminated {dependent or criterion variable). All the variables Sj are supposed 
to be at least on ordinal scale. 

In the A:-dimensional contingency table of Si,...,Sk, let the /i-th cell be denoted by 
C/i, Ae{l, ... ,/?}, where p is the cardinality of the Cartesian product of the factor 
levels. Note that C/, represents a combination of the levels of k factors. 

Let D{X I C/,) be the conditional distribution of the criterion variable within each 
cell. 

Two cells Ch and C/,- (V h,h' e{\,...,p] , h ^ h'^ are adjacent if they differ exactly 

by one level of any factor. From a geometrical point of view, the cells Ch and Ch' 
may be seen as hypercubes in a A-dimensional space and they are adjacent if they 
have one face (hypercube of dimension k-1) in common. 

Consequently, an adjacency matrix A of a graph with vertices {C/,: h=\, ... ,/?}, 
can be computed as a pxp matrix with entries {1,0}, depending on whether cells 
of the contingency table are adjacent or not. 

Notice that the above notion of adjacency might be modified in order to take into 
account additional statistical information on the structural factors, or different 
types of constraints on the cells. See Gordon (1996) on the construction of 
different adjacency matrices in classification. 

Let D be the matrix, whose entries are the dissimilarities between all pairs of the 
conditional distributions D{X \ C/,), V A = 1, . . . , p . 

The dissimilarity between C/, and C/,- will be considered according to the level of 
measurement of X. When X is on a nominal scale, the chi-square distance or the 
following are considered in this paper and when X is a categorized 

quantitative variable with r levels, a measure of dissimilarity between Ch and Ch' 
2 ’ 



;=1 /=1 

where Y , and F denote the observed probability density function and the 
cumulative distribution function of D{X | C/,) and D{X | Ch), respectively, and Xj 
and Xi+i, (xi < Xj+i) are the end-points of the /-th interval ofX 



3. Methodology 

3.1 The algorithm 

The aim of the procedure is to determine, on the basis of the pxp dissimilarity 
matrix D, subject to adjacency constraints of cells, disjoint classes of the 
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combinations of factor levels, by applying a constrained aggregative cluster 
analysis algorithm. 

In the first step of the algorithm, the two closest adjacent cellS/ C/, and Ch' are 
merged to form a new class. 

In the next step, dissimilarities between merged cells and other cells are updated, 
by considering for the new class the conditional distribution D{X \ C/, u Ch) and 
obtaining the new (p-l)x(p-l) dissimilarity matrix. 

The binary adjacency matrix is also updated, according to the adjacency between 
the just-formed class and any other cell. The updated (p-l)x(p-l) matrix A can be 
eventually weighted to take into account the adjacency degree of each class, i.e., 
from a geometrical point of view, the number of hypercubes of dimension kA that 
are adjacent to each of the other classes. 

In the next steps, the process of merging the two closest classes is repeated, until 
only one cluster is left, according to the usual aggregative hierarchical clustering. 
At the end of the process a segmentation of the population can be determined, by 
selecting one of the generating partitions. 

The main steps of the algorithm are summarized in the scheme 1. 

Scheme 1 : Steps of the constrained aggregative algorithm 

Step 0: (inizialization) Let k+1 variables (Si,..., Sk, X) be given. Determine the k- 
dimensional contingency table C of S 1 XS 2 X. . .xSk; let p be the number of the cells 
ofC. 

Step 1: Compute the matrix D={d;,;,*}, where &hh’ is the dissimilarity between 

D(jqC/,)andD(jqC,). 

Step 2: Compute the matrix A={ahh’}, where a/,/,' is an indicator variable for 
adjacency or an adjacency level. 

Step 3: Merge the two closest adjacent cells or set of cells to form a new class C/,. 
Step 4: Within the set of cells Ch compute the conditional distribution (J(^C/,). 

Step 5: Update the dissimilarity matrix D. 

Step 6: Update the adjacency matrix A. 

Repeat Step 3 to Step 6, p-1 times. 



Obviously, it is possible to perform a non-hierarchical algorithm, if a hierarchical 
structure is not required and a partition in a prespecified number of classes is 
desired, or when the number of cells p is very large. 



3.2 The choice of the partition 

For any two following steps, which correspond to the j and (/-7^-segment solution, 
it is possible to consider the arithmetic means of the corresponding D matrix 
entries and the ratio between such means {mean distance ratio), which indicates 
how much the average dissimilarity between segments increases (decreases), 
passing from the j to (/-/^segment solution. Global or local peaks of these 
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plotting values with respect to the number of segments, can give some indications 
on how to choose the segmentation, because they correspond to an increase of the 
dissimilarity between segments, which denotes a good separation between the 
elements of partition. 

Obviously, the choice of the number of segments is the well-known problem of 
identifying the most appropriate partition, by searching for a “stopping rule” to 
detect nested cluster structure, possibly at different levels of the hierarchical 
clustering, (Milligan and Cooper, 1985) 

4. An application 

The data were collected by an analysis of the choice of high school type, for 
children in families with different socio-economic features. For the sake of clarity, 
we recall that in Italy students, after their compulsory education, may choose 
among “Upper Secondary Schools” with different specializations 
(scientific/classical, technical, artistic, teacher training, professional training). 

A number (1227) of students attending the 1995/96 class of Psychometry at the 
University of Rome “La Sapienza”, were asked some information about their 
families. Table 1 displays the 3 factors recorded for each family and their levels. 



Table 1 





Structural factors 


Levels 


Labels 




Family income 


3 


l=Low 2=Medium 3=High 


S2 


Family size 


4 


1 = Up to 3 components 

2 = 4 components 

3 = 5 components 

4 = More than 5 components 


S3 


Parents’ highest educational level 


3 


1 = Up to compulsory school 

2 = Secondary school 

3 = University 



Table 2 shows the levels of the variable under study, that is referred to each child 
of the family. The structure of the data set is appropriate to the proposed 
procedure, because the structural factors are observed for each family, while the 
criterion variable is recorded for each child within the family. 

Table 2 





ll'IHIIIIIIilllll— — 


Categories 


Labels 


X 


Type of Secondary School attended by 
each child of the family 


9 


xi= Scientific Lyceum 
X 2 = Classical Lyceum 
X 3 = Foreign Languages Institute 
X 4 = Artistic Institute 

X 5 = Teacher Training Institute (for primary and 
nursery school-teachers) 

X6= Industrial Technical Institute 

X 7 = Professional Institute 

xg= Commercial and Business Institute 

X 9 = Agricultural and Other Technical Institute 
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For each combination of factor levels (i.e., each cell of the 3x4x3 contingency 
table of the 3 factors), the observed conditional distribution of the type of 
Secondary School attended by children is given as input to the procedure. 

The technique described in the previous section is employed to determine socio- 
economic strata that can “explain” the choice of children’s Secondary School. 
Clearly, in order to investigate whether different socio-economic characteristics 
can be considered relevant to the choice of the children, the 3 structural factors 
considered in the example are not exhaustive and further analysis would be 
desirable, but this is beyond the purpose of this paper. 

Figure 1 shows a tree diagram of the whole hierarchical agglomerative process, 
which uses as ordinates for the various branches the cumulative levels of fusion 
distances (which are not monotonic because of the adjacency constraints). 



Figure 1 : Tree diagram 




The distances between conditional distributions are computed according to 
formula (1). 

Figure 2 shows the mean distance ratio with respect to the number of segments to 
give some indications on how to choose the segmentation. 

The 9 and 5-segment solutions seem to be significant, because they correspond 
respectively to both a global and a local increase of the dissimilarity between 
segments, which denotes a good separation between the elements of partition. 

The 5 -segment solution is given in Table 3, that displays the factor combinations 
or cells in each segment and the corresponding conditional distributions of the 
variable X. 

The analysis of the solution reveals that the segments are characterized as follows: 
• SEGMENT 1 : medium income, medium family size, parents’ low educational 
level. 
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• SEGMENT 2: high income, medium family size, parents’ high educational 
level. 

• SEGMENT 3: low income, medium family size, parents’ medium-high 
educational level. 

• SEGMENT 4: low income, small family size, parents’ low educational level. 

• SEGMENT 5: medium-high income, medium-large family size (with some 
small families). 



Figura 2: Mean Distance Ratio Plot 




Table 3: 5-segment solution: composition and distribution of the variable X 



SEGMENT 


FACTOR COMBINATIONS 


Xi 


X2 


X3 


X 4 


X 5 


Xe 


X 7 


Xa 


X 9 


1 


221 


0 


0 


0.33 


0 


0.33 


0 


0 


0.33 


0 


2 


323 


0.11 


0.44 


0.11 


0 


0 


0 


0.33 


0 


0 


3 


122 123 133 


0 


0.33 


0 


0 


0 


0 


0 


0.67 


0 


4 


111 112 113 121 211 


0 


0 


0 


0.02 


0.08 


0.35 


0.45 


0 

0 


0 


5 


131 132 141 142 143 

212 213 222 223 231 

232 233 241 242 243 

311 312 313 321 

322 331 332 333 341 

342 343 


0.23 


0.12 


0.04 


0.13 


0.09 


0.04 


0.09 


0.21 


0.05 



The finest 9-segment solution reflects the same characterization because the only 
essential difference is in segment 5, which is separated into 4 subsegments: two 
formed only by one factor combination (331 and 143) and two other segments 
containing mostly small families with high income and medium-large families 
with medium income, respectively. 

The procedure has also been carried out by using the chi-square distance between 
pairs of cells, with no significant differences in the results. 



5. Conclusions 

The proposed procedure determines a segmentation of a population which has 
well-defined structural characteristics, in the sense that different segments, each 
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characterized by similar combinations of factor levels, will be different mainly as 
to the conditional distributions of the criterion variable. 

This kind of segmentation is especially useful in situations where there is a large 
number of statistical units. In fact, the computations involved depend on the 
numbers of combinations of factor levels and not on the size of the population. 
Therefore, the procedure is particularly appropriate for very large populations 
(virtually infinite) as, for example, in market segmentation, in socio-economic 
stratification problems, or in epidemiological surveys, where it can be employed 
both as a preliminary analysis or as an effective tool of investigation. 
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Abstract: According to the idea of the predictive power of a class in a 
classification (Gilmour 1951), our aim is to build maximal predictive 
classifications (Gower 1974) of objects in a sequence, that respect the total 
order defined by this sequence. We propose a new dynamic programming 
algorithm able to discover a maximal predictive partition and which complexity 
is linear with the length of the sequence and with the number of possible 
predictors. This algorithm accepts vast range of predictor shapes and may be 
used for numerous possible applications. We present an experiment of this 
clustering algorithm on biological sequences. 

Keywords: Optimal clustering. Maximal predictive classification. Maximal 
predictive partition, Constrained classification. Biological sequences. 

1. Maximal predictive constrained classification 

Since Gilmour (Gilmour 1951) many investigators have suggested that 
classifications need to be evaluated with respect to their predictive power. This 
means that a class is all the more useful when the knowledge a priori that an 
object belongs to that class allows for the most numerous a posteriori 
predictions on the properties of this object. Following this principle, it is better 
to classify flowering plants by the number of stamens than by the number of 
leaves because to know that a plant has three stamens permits, for example, to 
predict that its leaves are likely to have parallel venation. On the contrary few 
plant properties are correlated to the number of leaves. In many ways these ideas 
are largely promoted in biological applications of classification (Archie 1984) 
(Colless 1986) (Tassy 1991). 

Gower (Gower 1974) was one of the first to give a precise operational 
meaning to these general guidelines. He proposes to measure the quality of a 
partition with the number of object properties correctly predicted by an optimal 
predictor assigned to each class of the partition. This is an additive criterion. 

Maximal predictive classification shares with conceptual clustering 
(Michalsky & Stepp 1983) the advantages of (1) proposing an explanation of the 
creation of classes (the optimal predictors) which facilitates the output 
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interpretation; (2) relying on the quality of these explanations for the 
constaiction of the clustering. 

But unlike conceptual clustering this method is not limited to the construction 
of monothetic classes (i.e. able to be characterised by a conjunct of properties). 

Sometimes in classification, the set of objects to be classified is structured a 
priori and this structure must be preserved by the analysis. We speak of 
constrained classification (Gordon 1981). We note here that the aim of the 
clustering procedure is neither to evaluate the structure given on the objects nor 
to propose another structure; the structure must be considered strictly as a 
constraint to be preserved. A usual structure is the total order structure: for 
example, if a geologist wants to study a drilling core, he has to usually take the 
structure of the successive layers into account. In genetics, the study of proteins 
or genetic sequences must consider the sequential structure of these objects. 

We have tried to build maximal predictive partitions of sequences where each 
object (having a specific position in the sequence) is described by a letter of an 
alphabet and where predictors are patterns of letters in the classes. To preserve 
the order of the sequence, the admissible classes are not all the subsets of 
positions but are restricted to those including only successive positions. 



2. The optimization problem 



If the sequence is made of letters of an alphabet A, our problem is, for a fixed 
to build a partition of the sequence in k classes (intervals) in such a way that 
each class of this partition is best predicted by a pattern of letters of A. We 
consider that a predictor predicts correctly at a position of the sequence if it 
generates the right letter at this position. We define a quality for classes and for 
partitions as the total number of good predictions. 

2.1 For sake of simplification, we first consider the case where only letters of 
the alphabet A are the possible predictors of classes. 

If our sequence is S = o\...oi {Or^ for each class c = Ot...o, we 

define the quality 7i{a, c) of the prediction of a g A on c as the number of letters 
of c correctly predicted by a. 



7r{a,c) =#|r € where # is the cardinal 






where 1, = 1 and 1. , = 0 

true false 



The optimal predictors of c are the letters a that reach the maximal quality, 
i.e. the most frequent letters. Thus, we define the quality of c as: 

7i{c) - max 7r{a,c) 
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This quality measures the adequacy between a class and its optimal 
predictors. A class redescription is constructed by copying an optimal predictor 
along the full length of that class. For example, with A = {a, c, g, t}, the quality 
of c* = ATTGT is 3, its Optimal predictor t and the corresponding redescription of 
C is TTTTT. 

For a partition p = (ci, ..., c„) with classes of S, the quality of p will 

be: 



2=1 

that is, the total number of letters of the sequence S correctly predicted when 
each class has a letter allocated as optimal predictor. This is according to the 
measure proposed by Gower. The concatenation of the redescriptions of all the 
classes of a partition gives a redescription of this partition. For example, if /? = 
(attgt, acc) with the same alphabet A as before, its quality is 3+2=5 and its 
corresponding redescription is tttttccc. 

2.2 If we use words as patterns of prediction, the quality n of the prediction 
of a word on a class will be the number of correct letters in the redescription of 
the class by this word. For example, if the word is at and the class is attgat, 
the redescription is atatat and the quality is 4. Using words permits to obtain 
more precise redescriptions. 



3. A linear algorithm 

For a fixed k, we want to optimize this quality function n on the set of 
partitions in k classes of the sequence. 

3.1 First, we present our algorithm in the simplest case where predictors are 
single letters. 

We use two properties of maximal predictive clustering : 

• In a partition, if two letters of the sequence are in the same class, they are 
compared to the same predictor; 

• The quality of a partition is the sum of the qualities of its classes, therefore 
any subset of the set of the classes of a maximal partition is a maximal 
partition of the concatenation of the elements of this subset. 

In a sequence 5' = 0 \... 0 \, we define for any m < /, = 0\...Ory,. 

To get the maximal predictive partitions in k classes of 5, we will successively 
compute all the maximal predictive partitions in 1, 2, 3, ..., /: classes of dn, for all 
m from 1 to /. 

We show the recurrent steps of our dynamic programming algorithm. Let's 
call the quality of the maximal predictive partitions in /? classes of dn,. /^[/] 
is the final result we want to reach. 
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• the qualities of the corresponding classes (i4i)i<m<;, easily 
computable. 

• Recursively, having the series , we try to get . 

For all the letters a in A, we note for a partition p = {c\, c„): 

n-] 

^a{p) = 2 ) 

/ = 1 

and for all m<l, P^[m\ is the maximal quality njj)) among the set of the 
partitions p in n classes of dm. These partitions reaching P^[m\ are called 
a-restrained maximal predictive partitions. We have: 

P"[w] = max 

ogA 



Now, we must getP^^[w]. In a a-restrained maximal predictive partition R, 

the predictor of the class owning Om is a and: 

• either Om is the only element of its class and so this partition is a maximal 
predictive partition in nA classes of 0\...0m.\ plus Om alone predicted by a; 
so its quality is: 

• or Om is in the same class as OmA and, a being the predictor of this class, 
that partition /? is a a-restrained maximal predictive partition in n classes 
of ox...Om.\ with the class owning Om-\ extended to own Om. So its quality is: 

And, finally, being the best of the two possibilities (see figure 1): 






Then, having {P"'\m])m, we successively compute [/«])« for all m, and at 

last {P^[m])rn. At the end of this process, we get P^[l]. 

It is clear that this algorithm allows to have all the maximal predictive 
partitions in n classes for all dm if we have all the maximal predictive partitions in 
n-\ classes for all dm. So we get all the optimal partitions, i.e. maximally 
predictive in our case, of 5 in /? classes for all n from I to A: successively. 
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For each a gA, it takes in time 0(1) to compute from (P^'\m\)mio (P^^rn^)m 

and, as we need all the #A to compute P"[m\ the computation time from 

(P^'\m])m to (P^[rti\)rn is 0(l.#A). So this algorithm has a linear time-complexity 
linear with the length of the sequence, the size of the alphabet and the desired 
number of classes : 0(kJ.#A). 



Figure 1 : workwg of the algorithm 




An essential feature of this algorithm is that for each possible predictor it 
handles variables bound to this predictor (theP^ with a a predictor). So, the set 

of all the possible predictors must be defined from the start of the computation. 
Here, this set was the alphabet A. 

3.2 With a small change, the same algorithm is able to handle words as 
patterns of prediction. On the class c = atacgta, with at as predictor, there are 
two possible redescriptions, atatata and tatatat, which have different 
qualities. So we must consider the phase of the repeated word in the class. To 
do that, we can fix the phase of a predictor in a class in such a way that the last 
object of the class is predicted by the last letter of the pattern. On the same 
example c, with the predictor act, the redescription of c is then tactact. 
Moreover, we fix the set of all the possible predictors as the set of the circular 
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permutations of the words in consideration. For example, for the word act, the 
possible predictors taken into account will be: act, cta and tac. 

Now, if is a possible predictor, we define Pa^_a,\pA before with 

a\..Mr used as the predictor of the last class. It is easy to check that the former 
algorithm still works with the new formula: 



K..a, M f + max(p"-' [m - l], [m - 1]) 



The time-complexity grows naturally with the size of the set of possible 
predictors. If we want to make maximal predictive partitions with words of 
length X on alphabet A, then the complexity will be We can also use 

together words of length 2 and 3 as possible predictors, in which case the time- 
complexity is 0(k.l.#A^). 

4. Discussion 

An algorithm based on Fisher’s work (Fisher 1958) has already been applied to 
constrained predictive maximal clustering (Lebbe & Vignes 1994a) (Lebbe & 
Vignes 1994b). This algorithm has a quadratic complexity with the length of the 
sequence, but, unlike ours, is independent of the cardinal of the predictors set, 
and so it can be used with more general measures of homogeneity of a class. 

On qualitative sequences, prediction measurement is a valuable homogeneity 
criterion of a class. For instance, with predictors restricted to be letters, low 
prediction is observed when all the letters have approximately the same 
frequency and high prediction when a letter is strongly predominant. In the case 
of more complex predictors it can also be interpreted as a more general 
homogeneity criteria. Moreover, because, unlike prediction is roughly 
proportional to the size of the class, maximal predictive clustering has the 
advantage of not constraining the size of the classes in an optimal partition. The 
main drawback of this criteria is its potential ability to generate several 
equivalent partitions but, in practice, this number is a pertinent indicator of a bad 
value of k. 

Another interest of the maximal predictive classification on sequences is the 
ability to produce a redescription of the original sequence which is a new 
optimally smoothed sequence. Moreover, the smoothing is, among all the 
sequences with the same length and made of k runs generated from the allowed 
predictors, the most similar to the initial one according to Hamming's distance. 
For example, if our sequence on A is 5 = agcgtagtagttccc, if we look for the 
maximal predictive classification in 3 classes of 5, and the set of predictors is A, 
the optimal quality is 9 and a resulting redescription is ggggttttttttccc. This 
redescription is an optimal smoothing of S according to this class specification. 
If the set of predictors is A^, say the 2 letters words on A, the optimal quality is 
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10 and it is reached with the smoothing: agagttttttttccc. If the set of 
predictors is A^, the optimal quality is 14 and it is reached with the smoothing: 

AGCGTAGTAGTCCCC. 

The smoothing gives a vision of the sequences by discarding the events with 
high frequency which blur large structures. It is interesting because the scale 
factor is given by the number of classes in the partition and not by the size of a 
moving window, difficult to set in a non arbitrary way. Moreover the scale 
factor varies naturally from place to place according to the modification of the 
local frequency variation of the events, and it can deal with more intricate 
smoothing than those made only by repetitions of letters but also with those 
formed by repetitions of words. Last but not least, the method chooses among 
all the possible smoothing an optimal one with properties clearly defined. 

In our implementation of the algorithm, without changing its principles, we 
can deal with ambiguous predictors like {a, t} where a letter is considered 
predicted if it is a or t. We are also able to specify different successive sets of 
predictors along the successive classes of the partitions. Moreover, as in practice 
predictions errors have not necessarily the same consequences, we can treat 
different functions of letter prediction by replacing the function 1 by an other 
one. Finally, we consider the combination of the maximal predictive 
classification with a translation in a new alphabet to obtain both a partition and a 
characterization of the classes not only in terms of their objects but in terms of 
properties of their objects. In order to cover all these possibilities we have 
conceived a simple language for specifying the clustering problem to be solved ; 
each problem being specified by its set of acceptable predictors the complexity 
of the algorithm increases only linearly with the size of this set of predictors. 



5. Applications 

Because the complexity of our algorithm is linear with the length of the 
sequence, it is far easier to study clustering on large sequences than former 
algorithms. In particular, nowadays there are more and more large genetic 
sequences (> 10^ letters) that are found and to be studied. Our algorithm may 
then be used to find structuration of complete genomes. 

We demonstrated our algorithm on the complete genome of bacteriophage 
lambda, which is 48,502 bp length. The possible predictors are the 4 letters a, c, 
G and T. We stop the partitioning when we have 5 classes (Figure 2). Each class 
is represented by an arch marked above with its optimal predictor. In this 
application we notice that the five partitions which are constructed together but 
independently optimized define a clear hierarchy. This behavior which is quite 
common in practice, even with more complex predictors, will be the object of 
further studies. 

In figure 2, the sequence is pictured by the lower horizontal line and the 
arrows represent the zones in which the genes are read in the direction shown by 
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the arrows. When there is no arrow, it means that there are no genes in this part. 
We notice that the five classes found by our program match these different zones 
and their limits are rediscovered with an accuracy better than 250 bp, confirming 
its practical interest. 
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Abstract: Clustering problems with relational constraints in which the 
underlying graph is a tree arise in a variety of applications: hierarchical data base 
paging, communication and distribution network districting, biological 
taxonomy, and others. They are formulated here as optimal tree partitioning 
problems. In a previous paper, it was shown that their computational complexity 
strongly depends on the nature of the objective function and, in particular, that 
minimizing the total within-cluster dissimilarity or the diameter is 
computationally hard. We propose a heuristic which finds good partitions for the 
first problem within a reasonable time, even when its size is large. Such heuristic 
is based on the solution of a linear program and a maximal network flow one, 
and in any case it yields an explicit estimate of the relative approximation error. 
With minor variations a similar approach yields good solutions for the minimum 
diameter problem. 

Key words: Contiguity-constrained clustering, trees, heuristics, computational 
complexity, linear programming, network flows. 

1. Introduction 

In many classification studies, given a set V of objects, one looks for clusters of 
V which are subject to constraints of different kinds. In particular, a special class 
of constraints has been studied by several authors, with slight differences from 
each other and under different names: conditional clustering (Lefkovitch, 1980), 
relational constraints (Ferligoj and Batagelj, 1983), contiguity constraints 
(Murtagh, 1985), connectivity constraints (Hansen etal, 1993). 

For example, if the objects to be classified are counties in a region, one usually 
requires that each cluster be formed by geographically contiguous counties. If 
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the objects are records in a relational data base, every record should be 
relationally accessible from any other record in the same cluster; and so on. 

In this paper we shall consider a reflexive and symmetric binary relation R in V. 
Such relation can be effectively represented by an undirected graph G (for the 
graph-theoretic terminology, cf. Lovasz, 1979). The node-set of G is the set V of 
all objects, which may always be taken as the standard set the set E of 

edges of G contains all pairs (2-element sets) (/j) such that iRj. A partition 

7T =|Ci,....,Cp| of F is declared to be feasible if, for each k = 1,2,...,/?, the 

subgraph G(Cj^) induced by cluster Cj^ is connected. The set of all feasible 
partitions of V into p cluster will be denoted by Tlp(G) . 

In the present work we consider the particular case when the graph is a tree 
r= {V,E). Clustering problems on trees arise in a variety of applications. Tree 
structures are featured in hierarchical data bases, in many communication or 
distribution networks (LANs, pipelines, etc), in coastal highways, in 
phylogenetic trees and so on; and in many applications it makes sense to require 
that the clusters correspond to subtrees of the tree under consideration. In the 
case of a tree, by “cutting” (deleting) /?-! edges, one obtains p subtrees, whose 
node-sets are the p clusters of a feasible partition. Conversely, all feasible 
partitions into p clusters arise in this way. 

As other Authors do (see, e g., Rao, 1971; Mulvey and Crowder, 1979; Delattre 
and Hansen, 1980), we view clustering as a combinatorial optimization problem: 
with any partition ;r of F one can associate an indicator f{7i) of homogeneity 
within clusters or separation between clusters. For example, given a dissimilarity 
matrix D = ^y], where rfy > 0 for each pair {ij) of objects, = 0 for all / and 
dij = dji for all (iJ), two common indicators of homogeneity are the inner 
dissimilarity (or total within-cluster dissimilarity) and the diameter. 

P 

irtd(n)=Y^ ( 1 . 1 ) 

k=l i,jeCj^ 

diam(7r)= max max dij (1.2) 

k = 1 ,..,/? 

respectively. Notice that minimizing the inner dissimilarity is equivalent to 
maximizing the outer dissimilarity. Then the general clustering problem with 
relational constraints can be formulated as follows: find W g Tlp(G) such that 



/ (k)= min (or max ) i f(n) : n g X\p(G) 



(1.3) 
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In a previous paper, (Maravalle, Simeone and Naldini, 1997), the computational 
complexity of problem (1.3) for trees has been investigated*. It was shown that 
minimizing the inner dissimilarity or the diameter is NP-hard and thus, unless the 
widely supported P ^ NP conjecture is false, problem (1.3) can be solved only 
by enumerative algorithms with exponential running time. This is unfortunately 
true even for trivial trees such as stars! On the other hand a statistician, who in 
real-life applications has to deal with trees having at least a few hundreds of 
nodes (objects), is probably interested in finding just a good feasible partition, 
perhaps not the best one, within a reasonable time. So he or she would be 
perfectly satisfied with a heuristic algorithm having the following two properties: 
i) it is fast; ii) an estimate of the percentage error is made available, either before 
or after the execution of the algorithm. In the former case one speaks of an a 
priori error bound, in the latter one of an a posteriori bound. Of course, for the 
heuristic to be of practical use, the percentage error should be relatively small. 

As an example of an a priori error bound, in the above mentioned paper it was 
proved that the greedy algorithm outputs a final partition ti whose outer 
dissimilarity, in the worst case, is within about 37% of the maximum one. 

In Section 2 of the present paper, we formulate the problem of minimizing 
ind(;r) over n^(r) as an integer linear program, for whose solution we describe 

a heuristic consisting of at most three stages. Stage 1 consists in the (exact or 
approximate) solution of a (continuous) linear program, which is described in 
Section 3. Some optimality properties for this linear program are given in 
Section 4. Stage 2 and Stage 3, described in Section 5, consist in the solution of 
a maximal flow problem in a suitable network and in a “cut adjustment” 
procedure, respectively. Such a heuristic is fast enough as to solve practical 
problems with relatively large size. After the execution of heuristic, an upper 
bound on the relative approximation error is available. The heuristic has the 
interesting feature that, although it does not guarantee the optimality of the final 
partition, optimality does hold if it stops in Stage 1 or, in some cases, in Stage 2. 
While other heuristics (such as the greedy or the interchange ones) perform a 
local search and thus have a myopic nature, our heuristic takes a global 
approach, since it is based on the global solution of some optimization problem 
approximating the given one, min ind{n ) . 

TtellpiT) 

On account of space, we will omit the proofs of all the theorems. For a full 
version of the results presented in this paper, including all the proofs, the reader 
is referred to Lari et al., 1998. 

2. Formulation of the problem as an integer linear program 

Given a tree T= (V,E) with V= {\,..,n}, an integer p, I <p < n and, for each 
pair of nodes i and j in F, a dissimilarity index d^j satisfying the above 



’ For a comprehensive exposition of complexity theory, the reader is referred to Garey and 
Johnson, 1979. 
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restrictions, we shall consider the tree clustering problem, where the objective 
function is (1.1). As mentioned above, the latter problem is NP-hard also for 
stars. In the following we will indicate with A the set of all unordered pairs (ij) 
of elements of V. Moreover, for any (/,y) g A, let Pjj be the unique path between 
/ and j and let be the number of edges of For any ordered pair of nodes i 
and j (/ ^ j\ let h{ij) be the node adjacent to j in the path P^y. 

We define for each {iJ) g A the following variable; 

_ J 1 if / and j belong to the same cluster 

[0 otherwise 

Note that if (/j) g P, i. e. (iJ) is an edge, then one has x^j = 1 if {iJ) is not cut 
and Xij = 0 if {ij) is cut. In order to find the constraints of the model we firstly 
impose the cardinality constraint on the number of edges that are cut: 

'^Xij -n-p (cardinality) 

Now we have to impose that the clusters be connected. For each {ij) g A-E the 
following statement must hold: 

3 {h,k) G Py such that x^j^ = 0 o Xy = 0. 

The implications => are equivalent to the following constraints: 

Xij < Xhk, {ij) e A-E , {kk) G Py (order) 

The reverse implications c= are equivalent to the following constraints: 

y x^j^ - Xj- < lij - 1 , {ij) G A-E (closure) 

{h,k)ePij 



Finally, by definition, the variables are subject to the integrality constraints and 
to the following constraints: 

0 < Xij < 1, {ij) G A (unit-interval) 

The objective function is: 

We obtain the following formulation: 




165 



V 



♦ 



s.t. 



= min 

[cardinality, order, closure, unit - interval, integrality] 



(Inner) 



This model has 



n{n-\) 

2 



variables and 0(«3) constraints. In fact there is a 



variable for each pair (ij) in A. Moreover, for each (ij) e A v/e have 0(«) order 
constraints, hence the total number of such constraints is 0(/i^). Finally we have 
a closure constraint for each (/J) e A-E\ then the total number of such 
constraints is 0(rfi). 

The following theorems allow us to find a more concise formulation of Inner. 
Theorem / - If all the variables x,y are binary, then the closure and the order 

constraints imply the following 0(n^) constraints: 



^ij — ^ihQ,j)’ ^ij — ’ 1 < / <_/ ^ w, (jj) ^ E (s-order) 

Xihdj) + -Xij<\, 1 < / <j < n. (iJ) € E (transitivity) 

Theorem 2 - If 0 < x,y < 1 for all pairs {ij), then the s-order constraints imply the 
order constraints and the transitivity constraints imply the closure constraints. 

By theorems 1 and 2 the following model is equivalent to Iimer. 



v* = min Y.^ijXij 

i'JhA . . 

S.t. [cardinality, s - order, transitivity, unit - interval, 
integrality] 



(S-Inner) 



This model has the same variables as Inner, and OirP-) constraints. Moreover, all 
its constraints involve at most three variables. 



3. The continuous relaxation 

Consider the general integer programming problem: 

min{ /(x):g(x)<0,xeZ” }, (P) 

where g. R'* R'”,/ R'^ R, R and Z being the sets of reals and integers, 

respectively. The continuous relaxation of P (denoted by P ) is the (continuous) 
problem obtained by dropping the integrality constraints (this means that 
X G R'»). 

Let us consider the linear program S- Inner (from now on denoted LP for 
short) given by the continuous relaxation of S-Inner. The optimal value of 
LP gives a lower bound of the minimum inner dissimilarity; moreover, if the 
optimal solution is integral, then it is the optimal solution to S-Inner. Notice 
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that, under the assumption that all dissimilarities are strictly positive, is also 
strictly positive in view of the cardinality constraint. If a linear program has only 
order constraints, then all its basic feasible solutions are integral (Picard, 1976); 
moreover Marcotorchino and Michaud, 1979, report that linear programming 
problems involving only transitivity constraints and where all variables take 
values in the interval [0,1], very frequently have integral solutions. Hence there 
is some hope that, in a non-negligible number of cases, the linear program LP 
admits an integral optimal solution, or, at least, that the percentage of fractional 
variables is small. In the former case, one has = v*; in the latter one, it is 
expected that usually the lower bound on v* given by vip is rather tight. In a 
very preliminary set of experiments, we solved LP for some small randomly 
generated test problems to see how many times non-integral solutions are 
obtained. In 450 examples with n < 13, we obtained non-integral solutions in 
about 30% of the cases and the average number of non-integral variables in the 
non-integral cases was about 15% of the total number of variables. Moreover, 
the percentage of problems with non-integral solutions increases when p 
approaches nil. Of course, a much broader experimentation, involving larger 
test problems, is needed in order to reach some definitive conclusions. 

Remark 1 - The optimal value of Inner is equal to the minimum inner 
dissimilarity of a partition k e /]^(7), and hence it follows that the cardinality 

constraint in Inner may be replaced by the inequality p . 

A similar remark applies to S-Inner and to LP. 

4. Some optimality properties 

In the following we shall call edge-variables the variables Xjj such that (ij) e E 
and non-edge-variables all other x^j. We shall show that at the optimum of LP, 
every non-edge-variable Xjj can be expressed as a simple function of the edge- 
variables x^f^ such that (h,k) is an edge of the path Pjj. Moreover, we shall exhibit 
a model equivalent to LP, involving only the n-\ edge- variables, the cardinality 
constraint and having a convex objective function. 

Theorem 3 - Given an optimal solution x to Inner or S-Inner or LP, then, the 
following solution is optimal to Inner or S-Inner or LP, respectively: 



ij ’ 
max 



{o> T.^hk-iij+^} > 

ih,k)eP,j 



(iJ)GA-E 



(4.1) 



Corollary 4 - If the dissimilarities are strictly positive, in any optimal solution to 
Inner or S-Inner or LP, the values of the edge-variables uniquely determine the 
values of the non-edge ones. 

Corollary 5 - If LP has an optimal solution where all edge-variables are binary, 
then it has an optimal binary solution. 
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Remark 2 - When all edge-variables are binary, they define a set of cut edges, 
and hence a feasible partition of T. In this partition, nodes / and j belong to the 
same subtree iff as given by (4.1), is equal to 1. This observation provides a 
direct, and more intuitive, way of computing the values of the 
non-edge-variables, when the edge-variables are binary. 

By Theorem 3, we can consider a new model which gives the same optimal 
solutions as Inner and S-Inner: 

V* = min Y. ^ij + 1 } 

(iJ)eA {h,k)ePtj (Convex) 

s.t. [cardinality, unit - interval, integrality] 



Similarly we can consider the continuous relaxation Convex of Convex which 
gives the same optimal solutions as LP. In the latter models only the edge- 
variables appear and the objective function is convex. 

5. A heuristic based on a lagrangian relaxation 

In order to find good solutions to the min inner dissimilarity tree clustering 
problem, we consider the lagrangian relaxation of S-Inner, obtained by dropping 
the cardinality and the transitivity constraints: 

{,U)eA (iJ)eE 

+ ^h(ij)j ~ ^ij ~^) (Inner-LR) 

l</< j<n 

S.t. [s - order, unit - interval, integrality] 

where juij > 0 for 1 < / <y < «, and we can consider X, > 0 because, by Remark 1, 
the cardinality constraint may be replaced by the constraint 

^(. >n- p . This problem is equivalent to a maximal network flow 

problem (Picard, 1976), hence its linear relaxation has an optimal integral 
solution. Since Inner-LR has an optimal integral solution, it is equivalent to its 
continuous relaxation and so it is a relaxation of LP. Hence 
v^(A,//) < vip. Moreover, LP is a relaxation of S-Inner and then vpp < v*. So 
the following Proposition holds. 

Proposition 6 - vpp {X,p) < < v* 

The integrality of the solution to the continuous relaxation of Inner-LR suggests 
to solve it for fixed values of X and //, and to adjust its optimal (integral) 
solution if it does not satisfy the cardinality constraint. In order to obtain good 
values for X and //, we solve LP and we consider the optimal values of the dual 
variables associated with the cardinality and the transitivity constraints. For these 
particular values of X and p, we have v^p = vip{X,jLi) (Nemhauser and Wolsey, 
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1988). Finally, in order to satisfy the cardinality constraint, we choose an integer 
k having a small value (1, 2 or 3) and we change the value of at most k edge- 
variables at a time in an optimal way. 

Inner-heuristic 

1 . Solve LP. If the optimal solution x' is integral then STOP: x’ is optimal for Inner; else let 
2 and // be the optimal values of the dual variables associated with the cardinality and 
transitivity constraints, respectively. 

2. Solve Inner-LR. Let x* be the optimal solution. If x* satisfies the cardinality and the 

9|e * 

transitivity constraints then STOP: x is optimal for Inner; else let <? = 2-r(/ j)eE^U ' 

3 . If ^ = n-p then STOP: the edge-variables induce a partition of T into p clusters. 

4. (Cut adjustment procedure) If \n-p-q\ > 0 then while \n-p-q\ > 0 

Let k' := min{/:, \n-p-q\ } 

Choose a set of E edge-variables x^y and change their values into ^*y ’= 1 - ^*y, so that 
\n-p-q\ decreases by E units and the resulting inner dissimilarity is as small as possible. 



Step 1 requires the solution of a linear program in which all coefficients are 0, 1 
or -1; hence its solution time is bounded above by a polynomial in n and p 
(Tardos 1986). In Step 2 a maximal flow problem is solved and, also in this case, 
the solution time is bounded by a polynomial in n and p (Ahuja, Magnanti and 
Orlin, 1993). Finally the cut adjustment procedure given in Step 4 requires at 



most 



P 

k 



iterations, each of which requires the computation of ind(;r) for at 



most 




feasible partitions k. Since A: < 3, it follows that also the cut 



adjustment procedure runs in polynomial time. Therefore, the running time of 
the overall procedure Inner-heuristic is bounded by a polynomial in n and p. 

The use of the polynomial algorithm of Tardos (1986) to solve LP is mainly of 
theoretical value. More practical solution approaches are: 

i. use any standard linear programming package; 

ii. use a primal-dual (simplex or interior point) algorithm; 

iii. solve directly the non-diflferentiable optimization problem Convex using, 
e. g., one of the algorithms described in Chapter VII (by C. Lemarechal) of 
Nemhauser et al., 1989. 

Strategy ii. has the advantage that, if the algorithm stops before reaching an 
optimal solution, good dual values X and pij are still available. Strategy iii. is 
especially suited for large problems, since the number of variables involved is 
0{n\ rather than 0{n^). Moreover, if LP is solved only approximately, in Step 
1, in order to obtain a better solution from Inner-LR, one could apply some 
iterations of the subgradient method to the dual Lagrangian problem 
(Nemhauser and Wolsey, 1988). 
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Remark 3 - is the inner dissimilarity given by the solution of the Inner- 
heuristic and if then one can obtain an upper bound of the 

relative error by computing: ^ . 

Remark 4 - The Inner-heuristic does not guarantee optimality but if it finds an 
optimal solution in Step 1 or in Step 2, then it recognizes it. 

Example 

Let us consider the following example with n=\0 and p = 5. 

The following tree and dissimilarity matrix are given 




The optimal solution to LP is: JC19 =xhq= x^^ =xg iq = 1, 

^24 == ^36 ^38 = ^45 = -^47 ~ ^ Other variables being equal to 0. The 

optimal value is v^p = 78.5. In Figure 1 we show with a continuous line the 
optimal partition which gives v* = 79, and by a dotted line the partition induced 
by the edge-variables of the solution to Inner-LR, where the inner dissimilarity is 
43. By applying the cut adjustment procedure in order to reduce the number of 
cut edges, we obtain the optimal solution. 




In Lari et al., it is shown that, by considering a formulation similar to S-Inner, it 
is possible to formulate as an integer linear program also the min diameter 
problem on trees. Moreover, good solutions to the latter problem can be 
obtained by applying a heuristic similar to Inner-heuristic. In this case the 
solution of a linear program is embedded in a binary search on the possible 
values of the diameter. Hence a sequence of linear programs, rather than a single 
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one, must be solved. However, these linear programs are usually smaller, since 

many variables are forced to take the value zero. 
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Abstract: In the present article, we discuss some conditions on the existence of 
asymmetric 2-principal points of univariate symmetric distributions and show 
some examples of 2-principal points that are asymmetric. 

k-Principal points of a distribution, proposed by Flury(1990), mean k 
points that minimize the expected squared distance from the nearest of the point 
with respect to the distribution. The criterion of principal points is the almost 
same as that of /:-means clustering. The /c-means clustering is a method for a 
data set, but the concept of principal points is a theory for theoretical 
distributions. 

Many people may guess that principal points of symmetric distributions are 
symmetric but Flury shows counter examples. We study conditions for 
symmetric or asymmetric principal points. A characteristic function p(c) for 2- 
principal points is defined in the paper. We investigate 2-principal points of 
mixed normal distributions with the characteristic function. 

Keywords: Cluster Analysis, k-mems, Optimal Allocation. 



1. Introduction 

Flury, Bernhard A.(1990) proposed principal points of distributions. The 
principal points is a set of points in p-dimensional space that approximate a 
given distribution. We will describe the definition of principal points. 

Suppose /(x) is the density function of a continuous random variable X and 
F{x) is the distribution function. We assume that the distribution has finite 
second moments. 

At first, we define the distance between a point x g R'' and a set of points 
{ yj],yj^W. The distance is defined by 

J(xly|,...,yj) = minl(x-y,,)'’(x-y, ( 1 ) 

\<h<k 
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Definition (Principal Points) 

( j G ( I ^j^k) are called /:-principal points of distribution F if 

Ef{d-{X I4, ^t))= min E^{d^(X ly yj)}. 

yjSR'' 

We put Pf:(k)-Efr{d\X\ ( $^)). In the case of p-2 and k = 2, k- 

principal points are to find two points that minimize the function 

^ (>’i ’ >^2 ) = ^ ( A" I , >^2 )) = " >’i )\i^- yi )' ]fMdx. ( 2 ) 

k-principal points ^ of random variable X minimize the expected 

squared distance of X from the nearest of the $ ^ Two principal points of 
univariate normal distributions: N(p , a are p ± (t (2 / n (Figure 1 ). 

For the explanation of the concept of principal points, we give a concrete 
problem. When you would like to post a letter, you might go to the nearest letter 
box. What is the allocation of k letter boxes in a town relatively most 
convenient for citizens on the basis of the squared distances to the nearest 
letterbox (Figure 2)? 



Figure 1 : Two principal points of Figure 2: A problem of letter boxes 

the standard normal distribution allocation 




2. Conditions on Symmetric 2-Principal Points 

Flury (1990) gave a theorem for symmetric 2-principal points: 
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Theorem 1 (Rury 1990) 

Let X denote a continuous random variable with mean fi =E{X), symmetric 
density /(x) , and finite second moments. Then the points 

=^l-E{\X-^l\), y,^fi + E{\X-n\) 

provide a local minimum of E[(f(X\y^, if and only if /(// )£(1X— // 1) cj • 



Strictly speaking, if E[d^{X\y ^,y is a local minimum, then 
fifi )£(IX- fi l)^|. If f{fx)E{\X- !i\)<\, then E[d\X\y^,y^} is a local 
minimum. We find the sufficient condition that y,, >^2 ^ \oc 2 l\ minimum 

of the function M(y^, y^). 

Theorem 2 (Mizuta) 

Suppose the same conditions as Theorem 1 and =0. Points y,, 3^2 
a local minimum of E[d^(X\y^, y^)} if and only if 

j p{c) = 0 

where c=(y^+y^)/2, G(c)=J'*^^x f(x) dx-\-c{l —2F(c)),p(c)=G(c)G\c) — c. 



It can be proved with this result that the 2-principal points of logistic 
distributions and those of Laplace distributions are symmetric. 



3. Two-Principal Points of Mixed Normal Distributions 

We apply the above condition to some symmetric univariate distributions 
with S-Language (Becker et al. (1988)) to find out asymmetric 2-principal 
points. Specially, we deeply investigate of mixed normal distributions 



F(jc)= (1 -£)A^(jc;0,l)+£Af(jc;0,a- ) 



The distributions are classified into three types (Figure 3). Mixed normal 
distributions of Type 1 have unique symmetric 2-principal points. Mixed 
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normal distributions of Type 2 have essentially unique asymmetric 2-principal 
points. Those of the Type 3 have essentially two candidates of 2-principal 
points: symmetric and asymmetric. Both candidates provide a local minimum 

Figure 3: Three types of 2-principal points of mixed normal distributions 



Figure 4: The relation between the three types and parameters e , 



Type 1 




When e =0.8 and a =0.2, we can find that the mixed normal distribution 
belongs to Type 3 with Figure 4. The figure 5 shows the graphs of 
p{c), and the range of c such that pYc)<0. There are three points 

satisfied the conditions of the Theorem 2: c= — 2.636, 0.0, 2.636. The function 
^ local minimum when c=0, but not global minimum (Figure 6). 
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The 2-principal points of the mixed normal distribution is (0.07623,-1.13063) 
and (-0.07623,1.13063) (Figure 7). 

Figure 5: The graphs ofp(c), M(y^,y^) and the range ofc such that p\c)<0 




c 



Figure 6: Local minimum but 
not principal points 
(c = 0) 



Figure 7: Global minimum i.e. 
principal points 
(c = 0.527) 





4. Concluding Remarks 

In this article, we reveal sufficient conditions of 2-principal points and a family 
of symmetric distributions whose 2-principal points are asymmetric. 
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Ink >2, asymmetric /:-principal points also exist. For example, 3-principal 
points of the mixed normal distribution ( e =0.8, a =0.3) are yj=-0.835, 
y2=0.018, y^=0.745, which are asymmetric. 

Tarpey(1994) and Li & Flury(1995) defined strongly unimodal and proved 
that if a univariate distribution is strongly unimodal, ^-principal points are 
symmetric. But, with our Theorem 2, we find distributions that are not strongly 
unimodal and have symmetric 2-principal points. 

We deal with 2-principal points of univariate distributions in the paper. So, 
we can find principal points by solving the equation of the theorem 2. In the 
other cases, we must use iterative algorithms (like ^-means clustering 
algorithm) and numerical integration (Shimizu et al. (1997).) 

A problem of principal points is a way for the estimation of principal points 
of sampling data. It is natural to use the /:-means clustering algorithm, but we 
do not evaluate them. We would like to take up the problem in the near future. 
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Abstract : The Clique Partitioning Problem constitutes a general framework for 
clustering, when similarities assume positive and negative values. 

The effectiveness of various heuristic methods of solving the above problem has 
been proved by several authors. 

In order to evaluate the validity of one optimal classification P* obtained by this 
approach, we propose a hierarchical process, working on the clusters of P*. Thus, 
we obtain a family of classifications, and the concept of pertinence enables us to 
detect the interesting classifications. 

Key words : Clique Partitioning, Simulated annealing, Tabu search, Pertinence. 



1. Introduction 



Cluster analysis sorts a set I of objects into "homogeneous" clusters. The usual 
approach [1,2, 3,4, 5] is to compute a symmetric measure of similarity Sij for each 
pair (i,j) of objects. 

In this paper, let us assume that the objects of / are evaluated with respect to p 
variables of the discrete type Fi,L 2 ,---,Lp, and define [5] : 

rij = C?ivA[{ke{\,...p}IVi^i)=Vj^)}] 



Fij= Card [ { ke{l,...,p} / F^/) ^ } ] 



V 

then the quantity 
is a similarity which takes values between -p and p. 



^ij ^ij ij 



Let P be one partition of /into q clusters P={G\,...,G^ }, then the following quantity 

val{P)=Y, 

k = \ iJeGf^ 

i<J 

may be considered as a "good measure of quality" of P. 
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Then the classification problem may be formulated as follows : 

(7i) Max [ Val{P) / P partition of / ] 

This problem, known as the "clique partitioning problem", is NP-hard. Exact 
algorithms based on branch-and-bound and polyhedral theory have been proposed 
by several authors [3]. But different heuristic approachs lead to good results, 
especially simulated annealing and tabu search [1]. 

In this paper we assume that problem (ti) is resolved (exactly or not), and P* will 
denote an optimal partition or a locally optimal partition obtained by heuristic 
techniques. 

We propose a tool for evaluating the validity of the classification P*. This tool is 
based on one generalization of the similarity concept. 



2. Consistency of a cluster, and similarity between two clusters 

Definition. 

Let } be a partition, we define : 

1- for each 

r{Gk)= if card{Gi,)>\ 

< '.ysG* 

'<j 

/(G^) = 0 if card{Gi,) = \ 



( y(Gk) may be considered as a measure of consistency for the cluster Gk ) 

2- for each pair (k,l), k,l e{l,...,r/}, ; 

Yu 

ieGk jsGi 

( S(Gk,Gi) may be considered as a measure of similarity between the clusters Gk, 
Gi). 



As an immediate consequence of this definition, the following formula hold : 

(?(g j KjGk ,Gi )= s\pj ,Gi )+<J(G^ ,G/ ) (1) 

y{Gk ^Gi) = y{Gk) + y{Gi)+S{Gk,Gi) ( 2 ) 
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Then, with the partition jP={Gt,...,G^}, we may associate the symmetric (q,q) 
matrix S(P), defined as follows : 

{Sjk {P) = 5\(}j,Gk ) ,k e{\,...,q]j itk 

Let Pn be the finest partition of 1 : 
it is obvious that 

^jk{P n)~'Sjk 

Moreover, we define : 

riPhtrPk) 

k =1 

f.^(Gj,Gk) 

y=lA=y-fl 

y[P] measures the global consistency of the q clusters of P, and 6[P] measures the 
global similarity between clusters. 

It may be noticed that 

• val{P) = y[P] 

thus, the problem (tt) consists of finding a partition which maximizes the global 
consistency. 

.y[/>] + 5[/>] = f 

i=\ j=i+\ 

Then, (tc) is equivalent to the following problem : 

Min [b[P] I P partition of I ] 

• If Pi denotes the partition of I in a unique cluster, Pi=({l,2,...,n}), then we get : 

val{Pi) = r[P\]='Z Z^/7 

i=\J=i+\ 

So, it appears, in this framework, that an element of the partition P=(Gi,...,Gq), 
for instance Gi is pertinent if : 

• y(Gi) > 0 (the similarities prevail over the dissimilarities inside Gi) 

• b{G\,G0 < 0 \fk=l,...,q (the dissimilarities prevail over the dissimilarities 

between Gi and any other cluster) 

Thus we propose the following definition : 
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Definition. 

The partition P=(Gi,. . .,Gq) is pertinent if ; 

\r{Gk =l,...,g 

The interest of this definition is first illustrated by the following result. 

Proposition 1. 

i) One pertinent partition at least exists for any problem (tt). 

ii) Any optimal solution of (tt) is a pertinent partition. 

Proof 

i) First we make the following two remarks. 

- Each diagonal element of the matrix S{Pn) is equal to zero. 

- Let us assume that for a given partition P, each diagonal element of S(P) is 
positive or equal to zero, and consider the partition P\ deriving from P by 
grouping together two clusters of P, the similarity of which is positive (assuming 
that such a pair of clusters exists) . From (2) it follows that each diagonal element 
of S{P^ is positive or equal to zero, moreover y[P']^[P]. 

Thus the hierarchical process, working on the clusters of Pn, grouping at each step 
together the two classes which have the greatest similarity, leads to a pertinent 
partition. 

ii) Let P=(Gi,...,Gq) be a non pertinent partition. Then either, there is a cluster Gk 

such that or there exists a pair of clusters (Gj,Gk) such that 6(Gy,GA-)>0. 

Considering the first case, y(Gjt) may be written as follows : 

r{Gk) = ^'£ 

ieGkJeGk 

As y{Gk)<0, there necessarily exists /eG^such that < 0. 

We define the partition P^ deriving from P by splitting Gk in the two classes 
{/}and Gk-{i} , then we obtain : 

Considering the second case, we define the partition P' deriving from P by 
grouping Gj and Gk, and we obtain : 

y[P^)=y[P]+8(G„G,)>y[P] 

Thus for the two cases, P is not an optimal solution for (tt). 
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3. Construction of a set of pertinent partitions 

Let us consider a partition P*=(Gi,...,Gq) which is locally optimal for (tt). 

First we have to be precise about what a locally optimal partition is : 

For any cluster Gk of P* , for each element / of Ga. we consider the partition P\ 
deriving from P* by one of the two following ways : 

- either transferring / from Gk to G/ (/ ^ k), 

- or defining a new cluster with i as the unique element. 

Then we obtain : val(P^) < val(P*) 

(assuming that y(0)=O in a logical way) 

It is worth noticing that the heuristics solving (tt) (especially simulated annealing 
and tabu search) always give a locally optimal solution. 

The following result gives the method of generating pertinent partitions. 

Proposition 2. 

Let us assume that the locally optimal partition P* is not pertinent. We consider 
the hierarchical process working on the clusters of P*, grouping at each step 
together, the two classes which have the greatest similarity. This process leads 
necessarily to a pertinent partition. 

Proof. 

Let us suppose that there exists a cluster Gk of P* such that y(Ga)< 0. Then we 
could find (see the proof of proposition 1-ii) an element ieGk such that 
< 0, and the partition P' deriving from P* by splitting Gk in the two classes 

{/}and Gk-{i} would verify : 

y[P]>y[PT 

That is in contradiction to the local optimality of P* and then we get : 

y(GA)>0 

As P* is not pertinent, there exists a pair of clusters (Gj,Gk) such that 6(Gj,Gk)>0. 
Then the arguments of the proof of the proposition 1-/ allow us to show that the 
hierarchical process necessarily leads to a pertinent partition. 

The essential interest of this approach may not be reduced to that of obtaining a 
single pertinent partition. For clustering applications we do not have to stop the 
hierarchical process when obtaining the first pertinent partition. Continuing the 
process, other pertinent partitions will often be obtained. 

Thus the study of that set of pertinent partitions may be of great help for choosing 
a "good" classification, taking into account not only the value of the criterion, but 
also the number of clusters or any other particular aspect. 
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4. Examples 

We have performed this method on several real-life problems. 

To solve problem (ti) we use a simulated annealing procedure as described in [1]. 
The computation time of the whole process is about 5 seconds (on a PC 
200MgHz) for a problem size equal to 50 (the number of elements to classify). 

Example L 

It concerns the classification of 36 different types of cetacea which are described 
with respect to 15 characteristics. The detailed description and data are available 
from Grotschel and Wakabayashi [3]. 

The partition P*, obtained by the simulated annealing procedure, is the optimal 
partition given in [3] ; we set P*={G\,G 2 ,G 3 ,G 4 ,Gs,Ge,G-]), using the same 
numbering than in [3]. In Table 1, we present the similarity matrix of P*. 

Table 1 : Simi larity matrix of P* (Example 1). 





Gi 


Gi 


G3 


Ga 


Gs 


G6 


Gi 


G\ 


24 














Gi 


-8 


48 












G, 


-134 


-202 


102 










G, 


-280 


-594 


-200 


704 








Gs 


-32 


-70 


-44 


-76 


9 






Ge 


-111 


-176 


-159 


-304 


-18 


70 




Gi 


-12 


-46 


-35 


-158 


-30 


-66 


10 



According with the optimality of P* and the proposition 1, it may be checked that 
P* is pertinent (each diagonal element is positive and each non diagonal element 
is negative). In table 2, we present the partitions obtained at each step by the 
hierarchical process working on the clusters of P*. Looking at these results, it 
appears that Q\,Qi,Q 3 deriving from P*, are pertinent partitions, having less 
clusters than P* for a slightly decrease of the criterion value. 

Moreover the comparison of val{P*) with val{P\) (we note P\ the partition with a 
unique cluster) gives an interesting light on the significance of the partition P*. 

Table 2 '. Hierarchical process (Example 1). 



Partition 


Val 


Pertinence 


P*=(Gi,G2,G3,G4,G5,G6,G7) 


967 


Yes (optimal) 


01=(GiUG2,G3.G4,G5,G6,G7) 


959 


Yes 


QlH GiUG2,G3,G4,GsUG6,G7) 


941 


Yes 


Q3~{ GiuG2,G3U G-j,G4,G$^Ge) 


906 


Yes 


Q4={ GiUG2,G3UG5UG(5UG7,G4) 


607 


Yes 


Q5~( G\'UG2,G}^G4'<JGs^G6'<JG7) 


-131 


No 


P\={G\'uG2'<-^G2^G4^Gs'<jG(,'<jG']) 


-1780 


No 
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Example 2, 

It concerns the classification of 27 dogs which are described with respect to 7 
characteristics. The detailed description and data are available from Saporta [6]. 
The partition P*, obtained by the simulated annealing algorithm, is made up of 6 
clusters ; in table 3 we present its similarity matrix. 



Table 3 : Simi larity matrix ofP* (Example 2). 





Gi 


G2 


G3 


G4 


G5 


Ge 


G^ 


51 












Gi 


0 


17 










Gi 


-36 


1 


38 








G4 


-8 


-13 


-5 


0 






G5 


-208 


-101 


-153 


-5 


67 




G6 


-36 


-31 


-129 


-29 


-31 


26 



Since 8(G2,G3)=1, P* is not pertinent, this is a case in which our simulated 
annealing algorithm does not give an optimal solution. 

In table 4, we present the partitions obtained at each step by the hierarchical 
process working on the clusters of P*. 

Table 4 ' Hierarchical process (Example 2). 



Partition 


Val 


Pertinence 


P*=(Gi,G2,G3,G4,G5,G6) 


199 


No 


(Gi,G2UG3,G4,G5,G6) 


200 


Yes (optimal) 


(Gi,G2UG3,G4UG5,G6) 


195 


Yes 


(G\'uG2^G;^,G4yjGs,Ge) 


159 


Yes 


( G 1 VJ G2*^ G3 , G4'^G5 G6) 


99 


Yes 


Pi =(G 1 l>’G2'^G3'^G4L>'G5L>'G6) 


-585 


No 



References 

[1] De Amorim S. & Barthelemy & J.P., Ribeiro C. (1992). Clustering and Clique 
Partitioning: Simulated Annealing and Tabu Search Approaches, in Journal of 
Classification, 9, 17-41. 

[2] Bock H. (1994). Classification and Clustering : Problems for the Future, in : 
E. Diday et al (eds). New Approaches in Classification and Data Analysis, 
Springer Verlag, 3-24. 

[3] Grotschel M.& Wakabayashi Y. (1989) A Cutting Plane Algorithm for a 
Clustering Problem, in : Mathematical Programming, 45, 59-96. 






184 



[4] Hansen P. & Jaumard B. & Sanlaville E. (1994). Partitionning Problems in 
Cluster Analysis : a Review of Mathematical Programming Approaches in : 
New Approaches in Classification and Data Analysis, E. Diday et al (eds), 
Springer Verlag, 228-240. 

[5] Marcotorchino J.F. & Michaud P. (1981). Heuristic Approach to the Similarity 
Aggregation Problem, in : Methods of Operations Research, 43, 395-404. 

[6] Saporta G. (1990). Probabilites, Analyse des Donnees et Statistiques, Technip, 
Paris, 




Global stochastic optimization techniques 
applied to partitioning^ 

Javier Trejos Alex Murillo Eduardo Piza 
PIMAD-CIMPA, School of Mathematics 
University of Costa Rica 
2060 San Jose, Costa Rica 
(jtrejos,murillof,epiza)@cariari.ucr.ac.cr 

Abstract: We have applied three global stochastic optimization techniques to 
the problem of partitioning: simulated annealing, genetic algorithms and tabu 
search. The criterion to be minimized is the within- variance. Results obtained 
are compared with those of classical algorithms and are shown to be better in 
nearly all cases. 

Keywords: genetic algorithms, simulated annealing, tabu search, within- 
variance, numerical variables. 

1 Partitioning 

Given a set = {xi.x ^, . . . , in with weights cjj, we look for a partition 
P — {Cl,. . . , Ck) of n in fc classes, such that the within- variance 

k 

W{P) = 

£=1 Xi^C^ 

is a minimum, where ge is the centroid of Ce and d is the Euclidean distance 
in MP. There exist many classical partionning methods, such as the Forgy 
/c-means (1965), the McQueen /c-means (1967), Diday’s “nuees dynamiques” 
(1971), etc. All of these methods depend on an initial partition that is im- 
proved it by a gradient-descent type algorithm and find a local minimum of 
W. This characteristic of the classical methods led us to apply modern global 
stochastic optimization techniques to this problem. The techniques applied 
are simulated annealing, genetic algorithms and tabu search. 

2 Global optimization for partitioning 

2.1 Simulated annealing 

Simulated annealing (SA) is a technique based on the Metropolis algorithm 
in Physical Statistics. It was proposed by Kirkpatrick et al. (1983) and by 

^Research supported by the University of Costa Rica, project number 114-97-332. 
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Cerny (1985), and a simplified version by Pincus (1968). Details on SA can be 
found in Aarts & Korst (1990). In a combinatorial optimization problem, SA 
improves the quality of a solution, but it has the special feature that it accepts 
some bad solutions. The probability of acceptance of bad solutions depends 
on a control parameter c that plays the role of the temperature in physical 
systems, and decreases as the system “gets cold”, that is, as c tends to 0. Using 
Markov chain models, it has been proved that SA finds, asymptotically, the 
global optimum of the problem. Some conditions required for this convergence 
are connectivity of the states (any two states are connected by a finite sequence 
of states such that the probability of transition from one state to the following 
state is strictly positive) and reversibility (if the system passes from state a 
to state b with probability p, then the system may come back from state b to 
state a with the same probability p). 

We have proposed (Piza & Trejos (1996)) a simulated annealing method ba- 
sed on the following definition of a new partition P from partition P: (i) select 
at random an object X{ G D; (ii) select at random an index £ G {1,2, . . . , fc); 
and (iii) put Xi in class Q. This method is similar to one proposed in Klein 
& Dubes (1990). With this definition for the generation of a new state (that 
is, a partition), conditions of asymptotic convergence of SA are satisfied and 
we apply the usual SA algorithm (see Aarts & Korst (1990)). A simplification 
of the variation of the criterion (AW = W{P) — W(P)) from one iteration to 
the next one makes the method converge quiet rapidly. Indeed, when object 
X moves from class Cj in partition P to class Ci in partition P, it is only 
necessary to compute the difference 

APT = W{P) - W{P) = ■ ^ \\g{C,) - x|r - --^^\\g(C,) - x||^ 

n(\Cj\ - 1) n(\Ce\ + 1) 

where \Cj\ is the cardinal of class Cj and we suppose that all the objects have 
the same weight (a similar formula can be found in the case of general weights). 
The centroids of the classes change in the following way: 

g{Cj -{x}) = 1 ^ |^_ ^ i\Cj\g(Cj) - x),g(Ce U {x}) = joj^(|C<|5(C'f) + x). 

2.2 Genetic algorithm 

A genetic algorithm (GA) is based on some features of the natural evolution of 
species. It handles a population of solutions of a combinatorial optimization 
problem, and combines the solutions with the so-called genetic operators, in 
order to avoid local minima. Classical genetic operators are selection, crossover 
and mutation. GA’s were proposed by Holland (1976) and details can be 
found in Goldberg (1989). Modeling by Markov chains ensures the asymtotic 
convergence to the global optima (Rudolph (1994)). 
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For partitioning, we use a population (a sample) of partitions, described 
by “chromosomes” of length n and “alleles” in an alphabet of k letters. For 
example, let P — (2311332321), where n = 10, fc = 3 and x\ is in class 2, 
X 2 in class 3, etc. The iterative method (Trejos (1996)) is such that in each 
iteration, the following genetic operators can be used: (i) Partitions are selected 
with a roulette-wheel procedure, such that the probability that a partition P 
will be selected is proportional to W(P). (ii) A mutation-type transfer, by the 
selection of a single object, with probability and the selection at random of 
an index of a class where the object will be put. (iii) A mutation-type exchange 
between two objects selected at random with probability pe. (iv) A crossover, 
such that a “cut” point is selected at random, for two partitions selected at 
random with probability pc-» and a combination of the partitions created as in 
the classical GA. (v) A forced crossover, such that two partitions P\ and P 2 
are selected at random with probability pj, and a class selected at random on 
Pi is copied into P 2 ; this procedure generates a new “son” S that will replace 
P 2 . The last of the genetic operators is better than classical crossover, since 
it handles good partitioning information; indeed, objects that are together in 
Pi will be together in S, a useful condition because Pi has survived some 
time (that is, some iterations), thereby increasing its chances of being a good 
partition. 

We have combined the above genetic operators with the classical Forgy k- 
means method: after a number of iterations, each partition in the population 
converges to a local optimum by the application of the A:-means method. With 
each solution obtained, the GA continues until a stop criterion is satisfied: the 
maximum number of iterations is attained or the variance of the W(P)’s is 
less than a given threshold. 

2.3 Tabu search 

Tabu Search (TS) has been proposed by Glover et al. (1993), a combinatorial 
optimization technique that handles a tabu list, which forbids the transition 
to the states in this list. The technique is iterative and every state has a 
neighborhood of states that can be reached from it. The new state is selected 
as the best one in the neighborhood, unless it is in the tabu list. 

In partitioning, we have applied TS (Murillo & Trejos (1996)) using a 
partition as the state of the problem and neighborhoods defined by the transfer 
of a single object to a new class. The tabu list is described by the class 
indicators of the objects that moved in the last m iterations; this list describes 
a set of partitions that cannot be reached unless they are the best partitions, 
in the sense of W, ever reached in the algorithm. The list must be long enough 
to avoid cycles, but not too long. Parameters such as m and the number of 
iterations need to be adjusted. The simplification of A(VF), mentioned in 2.1, 
is also useful here, since it reduces considerably the time of execution. 
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3 Numerical results 

We have applied the three methods to several data sets. Table 1 shows the 
values of IV over the French Scholar Notes^, Amiard’s Fishes^, the Thomas’ 
Sociomatrix^ and the Fisher’s Iris"^. Results for Forgy fc-means and Ward’s 
hierarchical classification (the tree was cut at the specified number of classes) 
are also presented for comparison. 



Table 1: Within-variance W using tabu search (TS), simulated annealing (SA), 
genetic algorithm (GA), Forgy k-means (kM) and Ward’s method; % indica- 
tes the percentage of applications of the method that found the best solution 
reported. 



French 

Notes 


TS 

W % 


SA 

W % 


GA 

w % 


kM 

W % 


Ward 

W 


2 classes 


28.2 


100 


28.2 


100 


28.2 


100 


28.2 


12 


28.8 


3 classes 


16.8 100 


16.8 


100 


16.8 


95 


16.8 


12 


17.3 


4 classes 


10.5 


100 


10.5 


100 


10.5 


97 


10.5 


5 


10.5 


Amiard’s 

fishes 


TS 

W % 


SA 

W % 


GA 

W % 


kM 

W % 


Ward 

W 


2 classes 


69849 


96 


69368 


100 


69368 


52 


69849 


49 


- 


3 classes 


32213 


100 


32213 


100 


32213 


87 


32213 


8 


33149 


4 classes 


18281 


100 


18281 


100 


22456 


90 


18281 


9 


19589 


5 classes 


14497 


97 


14497 


100 


20474 


38 


14497 


1 


14497 


Thomas’ 

sociom. 


TS 

W % 


SA 

W % 


GA 

W % 


kM 

W % 


Ward 

W 


3 classes 


271.8 


100 


271.8 


100 


271.8 


85 


271.8 


2 


279.3 


4 classes 


235.0 


100 


235.0 


100 


235.0 


24 


235.0 


0.15 


239.4 


5 classes 


202.6 


98 


202.4 


100 


223.8 


4 


202.6 


0.02 


204.7 


Fisher’s 

Iris 


TS 

W % 


SA 

W % 


GA 

W % 


kM 

W % 


Ward 

W 


2 classes 


0.999 


100 


0.999 


100 


0.999 


100 


0.999 


100 


- 


3 classes 


0.521 


76 


0.521 


100 


0.521 


100 


0.521 


4 


- 


4 classes 


0.378 


60 


0.378 


55 


0.378 


82 


0.378 


1 


- 


5 classes 


0.312 


32 


0.329 


100 


0.312 


6 


0.312 


0.24 


- 



^Schektman, Y. (1978) “Estadistica descriptiva: analisis lineal de datos multidimensio- 
nales”, 1st. part. Proc. Symp. Math. Meth. Appl. Set., J. Badia et al. (eds.), Universidad 
de Costa Rica, San Jose: 9-67. 

^Cailliez, F.; Pages, J.P. (1976) Introduction a V Analyse des Donnees. SMASH, Paris. 
'^Data can be found in Everitt, B.S. (1993) Cluster Analysis. 3rd ed. Edward Arnold, 
London. 





189 



4 Concluding remarks 

Even though our methods depend on initial partitions, the optimization te- 
chniques used usually avoid local optima and search for a global optimum. 
The three methods proposed here require more time but yield better results 
than those obtained with classical methods. We believe the improvement in 
quality justifies the use of these new methods. It seems that results with simu- 
lated annealing are better than the rest, since in nearly all cases we found the 
global optimum in 100% of the experiments. However, simulated annealing 
did not find the optimum for the Fisher’s Iris with 5 classes, and with 4 classes 
it only found it in 55% of the times. Tabu search is also a good technique 
for partitioning with medium-size data tables, and it is faster than simulated 
annealing (in another paper we will report a comparison of the time of each 
method). The genetic method is better than the A:-means and the hierarchical, 
but it is worst than the simulated annealing and tabu search ones, and it is 
very slow in the implementation. 

A deeper analysis of the parameters of the three methods is needed: length 
of the Markov chains, decrease of the control parameter, initial and final control 
parameters, in SA; stop criterion, size of the sample and probabilities for the 
genetic operators, in GA; and length of the tabu list and maximum number 
of iterations in TS. A combination of the techniques may give some insights 
into the handling of some of the parameters. For example, GA-I-SA can give a 
stop criterion when the control parameter of SA is close to zero; TS+SA gives 
also a stop criterion for TS. These ideas are under investigation at present. 
Some generalizations of the methods have also been undertaken: the use of 
distances other than the Euclidean one and the use of non-numeric data. At 
the present time, we are studying the application of these global optimization 
techniques to two-way and fuzzy partitioning, as well as to Multidimensional 
Scaling. Another area of current research is the application of a distributed 
search algorithm (Courrieu (1993)) to the partitioning problem. 
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Abstract: A new approach to cluster analysis has been introduced based on 
parsimonious geometric modelling of the within-group covariance matrices 
in a mixture of multivariate normal distributions, using Bayesian calculation 
and the Gibbs sampler. The approach answers many limitations of the 
Maximum likelihood approach. Here we propose to investigate more general 
models dealing with the shape and orientation of the clusters using the 
Gibbs Sampler. 

Keywords: Bayes Factor, Gaussian Mixture, Gibbs Sampler. 



1. Introduction 

Cluster analysis has been developed mainly through the invention of em- 
pirical, and lately Bayesian investigation of ad hoc methods, in isolation 
from more formal statistical procedures. In recent years it has been found 
that basing cluster analysis on a probability model can be useful both for 
understanding when existing methods are likely to be successful and for sug- 
gesting new methods (Symons, 1981: McLachlan and Basford 1988; Banfield 
and Raftery 1993). One such probability model is that the population of 
interest consists of K different subpopulation, and that the density of a p- 
dimensional observation x from the fcth subpopulation is /jfc(x, 0) for some 
unknown vector of parameters 9. Given observations x = (xi,...,Xn), 
we let z/ = (i^i, . . . , UnY denote the identifiying labels, where ui = k if Xj 
comes from the kth subpopulation. In the so-called classification maximum 
likelihood procedure, 9 and u are chosen to maximize the likelihood 

( 1 ) 

i=lk=l 

In Bayesian classification, we assume that the data to be classified x^, (i = 
1, . . . , n, Xj G R^) arise from a random vector with density p(x, 9) as in (1) 
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and that the corresponding classification variables Ui are unobserved. We 
are concerned with Bayesian inference about the model parameter tt and 
the classification indicators 

Bensmail et al (1997) proposed a Bayesian approach to overcome the diffi- 
culties which arose in Banfield and Raftery’s work (1993) -hereafter BR-. 
In particular, with the BR algorithm, the user must specify the shape ma- 
trix A when using the model [DkADl]. Dasgupta and Raftery (1995) used 
the same model to detect features in spatial point process where the matrix 
shape was unknown but they restricted the diagonal terms of the matrix 
shape to be equal: A = dm^{l, a, . . . , a} and to have a low value (a is 
the shape parameter). In our approach, the shape matrix is completely 
unknown and is estimated using a fully Bayesian calculation. 

In Bensmail et al (1997), four parsimonious models were explicitely con- 
sidered to classify the data. These are the spherical models [XI], [A^/], 
the linear model [E] and the proportional model [A^E]. We propose to 
extend this work to the family of models where the covariance matrix is 
represented by [XD^ADl] and [X^DkADl] using a fully Bayesian inference 
via Gibbs sampling. These overcomes all the limitations mentioned. We 
use the Laplace-Metropolis approximation to calculate Bayes factor (Lewis 
and Raftery 1997); this approximation is used to choose the model and to 
determine the number of groups simultaneously. In section 2, we outline 
Bayesian calculation of the models, and how Bayes factor is approximated 
from the MCMC output. In section 3 we show the methods at work on 
diabetes data (Reaven and Miller 1979). 



2. Model and Estimation 

We assume that data Xj, (z = 1, . . . , n; E BP) to be classified arise from a 
random vector with density 



n K 

p(x, 6') == n E '^kfki^i, fj-k, Sfc) 

Z=1 fc=l 

where /(., E^t) is the multivariate density function with mean fik and 
covariance matrix E^, tt = (tti, . . . , tt/c) is the mixing proportion (tt^ > 
0, TTfc = 1). The approach is based on a variant of the standard spectral 
decomposition of Ejt namely 



Ea: = XkDkADl, 



where Xk is a scalar, A^ = diag{l, ak 2 , • • • , 1 > CLk 2 > • • • > CLkp > 0 

and Dk is an orthogonal matrix. We are concerned with Bayesian infer- 
ence about the model parameters 9, tt and the classification indicators Vi. 
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MCMC methods provide an efficient and general recipe for Bayesian anal- 
ysis of mixtures. For instance, many authors have used the Gibbs sampler 
or the Data Augmentation method of Tanner and Wong (1987) for estimat- 
ing the parameters in univariate and multivariate Gaussian mixtures and 
proved that both algorithms converge to the true posterior distribution of 
the mixture parameter. 

The models we are investigating are: [DkADl], [XkDkADl] and [E^;]. We 
use conjugate priors for the parameters tt and 6 of the mixture model. The 
prior distribution of the mixing proportions is a Dirichlet distribution 

(tti, . . . , ttk) ~ D{ai, . . . , 

and the prior distributions of the means fik of the mixture components 
conditionally on the variance matrices E^ are Gaussian: 

^ Afp{^k^'^k/'^k) 

The prior distributions of the shape matrix terms ai, . . . , and the volumes 
Ai, . . . , A/<: are gamma inverse 

at ^ IG{n/2, pt/2) (t = l,...,p) 

A,~IG(/,/2,W2) 

We estimate the models by simulating from the joint posterior distribution 
of 7T, 0 and u using the Gibbs sampler and following the same steps as in 
Bensmail et al (1997), so the Gibbs sampler steps go as follows: 

1. Simulate the classification variables z/j according to their posterior prob- 
abilities {Uk, k = I, . . . ^ K) conditional on tt and 0, 

, '^kfkj^ill^k^ ^k) • -j 

2. Simulate the vector tt of mixing proportions according to its posterior 
distribution conditional on the i/^’s. 

3. Simulate the parameters 9 of the model according to their posterior dis- 
tributions conditional on the u's. 

For choosing both a ’’model M 2 against a model Mi” and the number of 
groups in one step, we compute the approximate Bayes factor from the 
Gibbs sampler output using the Laplace- Metropolis estimator (Bensmail et 
al (1997)) given by the formulas: 

^ ^ P(x|M2) ^ 

“ p(x|Mi) “ |4-i|i/2p(x|0{i))p(0(i)) 
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— 1,2) is the posterior mode of (0^*^ is the parameter of the 
Model Mi), and is minus the inverse Hessian of h{6) — \ogp{x\9)p{9), 
evaluated at 0 = 9^^\ 

The Laplace method requires the posterior mode, 9, and |^|. The Laplace- 
Metropolis estimator estimates these from the Gibbs sampler output. The 
Likelihood at the approximate posterior mode is 

n K 

p(x|0) = n L *kfk{~X.i\h, Sfc) 

2=1 k=\ 

which are then substituted into the previous equation to obtain the Bayes 
factor. 



3 . Diabetes example 

Reaven and Miller (1979) described and analyzed data for 145 subjects, 
consisting of the area under a plasma glucose curve (glucose area), the area 
under a plasma insulin curve (insulin area) and steady-state plasma glucose 
reponse (SSPG). The subjects were clinically classified into three groups, 
chemical diabetes (Type 1), overt diabetes (Type 2), and normal (nondia- 
betic). Symons (1981) reanalyzed the data using seven different clustering 
criteria. The data have the three-dimensional shape of a boomerang with 
two wings and a fat middle as we can see in Figure 1. One of the wings corre- 
sponds to patients with overt diabetes, the other wing is composed primar- 
ily of patients with chemical diabetes, and the ”fat middle” is composed of 
normal patients. Banfield and Raftery(1993) evaluated the criterion based 
on the model [XkDkAD]^, where A = diag{\,a,a}, which means that the 
clusters have different sizes and orientations but have the same ’’tubular” 
shape. The estimated values of a are 0.09, 0.19 and 0.34, and the results 
reported by BR algorithm is a = 0.2. The model [XkDkADl] with three 
groups is favored quite strongly over the alternatives. The posterior means 
of Ai^ = dza^{0.09, 0.09, 0.43}, X 2 A = dia^{0.30, 0.36, 0.47} and 
X^A = diag{0.13, 0.16, 0.35}. 

The optimal classification resulted in only 9% of the points being misclas- 
sified which are mostly situated at the border of the clusters. Ten percent 
error rate was obtained by the BR algorithm, but the misclassified points 
were dispersed over the three clusters. The present results compare our 
approach favorably to the procedures of Symons(1981) and BR(1993). 
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Figure 1: Three two-dimensional projections of the three-dimensional di- 
abetes data. The symbols indicate the clinical classification of subjects as 
having chemical diabetes, overt diabetes, or being normal. The Lower right 
panel shows the three clusters in the diabetes data found by our approach 
using the model [XkDkADl]. 
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4. Discussion 

We have presented a fully Bayesian analysis of a model-based clustering 
approach based on a mixture model in which the features of interest are 
presented my multivariate normal densities, specified by the shape being 
the same across clusters, while the features and orientations are different. 
Alternative frequent ist approach, consist of maximizing the mixture likeli- 
hood using the EM algorithm. Dasgupta and Raftery (1995) considered the 
same model in which the features of interest are presented by multivariate 
normal densities with high linearity, specified by the shape being the same 
across features, which make the model restricted. In our case, we consid- 
ered the case where shape, orientation and volume are totally unknown and 
no restriction is made on the shape parameter, thus our approach is more 
general and overcomes the limitations of Dasgupta and Raftery (1995) and 
is an extention of the work of Bensmail et al (1997). 
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Abstract: The resubstitution error estimate for the partitioning classifica- 
tion rule from a sample (Ai, Ui), (X 2 , ^ 2 ), • • • , (X„, y^) is shown to be asymp- 
totically normal under the condition that X has a density /, if the partition 
consists of rectangles. 
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1. Introduction 

Let X be the d-dimensional feature vector with distribution //, and let Y be 
the binary valued label. Denote the aposteriori probabilities by 

Pi(a;) = P{y = z|X=x}, 2 = 0,1 

and the Bayes error probability by 



L* = E{min(Po(X),Pi(X))}. 

For the pattern recognition problem let (Xi, Yi), (X 2 , U 2 ), . . . be i.i.d. 
random variables such that (Xi,Y) have the same distribution as (X, Y). 
Let Vn = = 1,2,...} be a partition of , and let Ai(^) denote 

the the subset in the partition that includes x. The resubstitution esti- 
mate Ln counts the number of errors commited on the training sequence 
(Xi, Yi), (X 2 , Y 2 ), . . ., {Xn.Yn) by the classifier, i.e. for a classifier it is 
defined as 

1 '' 
i=\ 

which for the partitioning classification rule 



9n{x) 



0 if ElLl ^{n=l}-f{XiG.4n(a;)} < ELl 

1 otherwise 



iThe research was supported by the Computer and Automation Institute of the Hun- 
garian Academy of Sciences (MTA SZTAKI). 
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can be written in the following form: 

oo 

Ln = ^ ^ niin{ (1) 

j=i 



where 






and 



1 

/i„(A) = ~^I{XieA}- 



Ln is an estimate of the Bayes error jR* restricted to the partition Vn- 



3 = ^ 



where 

u{A) = Ei/,(A). 

2. Asymptotic Normality 

Concerning the resubstitution error estimate for partitioning rule the fol- 
lowing inequalities are known (see Sec. 23.2 in Devroye, Gyorfi and Lugosi 
(1996)): for arbitrary partition 



Var(Ln) < - 
n 



and 



ELn < i?;. 



For finite partition of size rrin 

K - ELn < 



2rrin 



Our aim is to derive the asymptotic normality of the distribution of 



Theorem 1 Consider the partitions where A^j are d -dimensional rectan- 
gles. Let al^{x), i = l,2,...,d, denote the sidelengths of An{x). Assume that 
for all X there exists a K{x) so that < K{x) for all 1 < i,j < d and n. 
Assume that for each sphere S centered at the origin 

lim sup diam{Anj) = 0 (2) 
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and 



and 



lim = 0 

n— >oo fi 



,. logn 

hm -T 7 . - ; - s T = 0, 

nA(yl„(a;)) 



where A is the Lebesgue measure. If fi has a density f and 

P{D{X) = 0} = 0, 

where D{x) = Pq{x) — P\{x), then there is a sequence Cn such that 

ni/2 (L„-C„)/yZ^4Ar(0,l). 

Theorem 2 If, in addition to the conditions of Theorem 1, for 

r{A) = fi{A) - 2u{A) = [ D{x)fi{dx) 

J A 

there is a constant c such that 
for all n and j then 

(L„-i?:)/yZ^4iV(0,l). 



( 3 ) 

( 4 ) 



(5) 

( 6 ) 



Remark. 

For cubic partitions with size hn these conditions mean ^ 0, nh^^ — )> oo 
and nh^l logn — > oo as n -> oo. 

3. Sketch of the proof of Theorem 1 

The main difficulty in proving the theorem is that it states asymptotic nor- 
mality of infinite sums of dependent random variables. Thus, we need a 
technique to extend asymptotic normality of the partial sums to the infinite 
sum. Assume that the {Anj} are ordered according to non-decreasing dis- 
tances of their centers from the origin. For a G (0, 1) choose the integer 
such that 

rUn rrin + l 

1 ^ ^ O; <C ^ ^ jli^Ajij) . 

3=1 3=1 

Roughly speaking, S'a^ = Uj^i ^nj is approximately a sphere centered at the 
origin with 1 — a ^ /^(5'an)- Because of (2) and the absolute continuity of fa 

0<an-a< fa{An,mn^i) < max/i(Anj) 0 

J 
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as n oo. Also define as the sphere centered at the origin with the 
property 1 — a = /i(5a). 

The basics of our proof is the Poissonization. Introduce the notation 
Nn for a Poisson (n) random variable independent of (Ai, Ti), (X 2 , Y 2 ),. . 
moreover put 

Nn 

i=l 



and 



Nn 

njln{A) = 

2=1 



Proposition 1 (Beirlant, Mason (1995)) Let Qnj be real measurable func- 
tions and let Mn,a = 9nj (i^n(Anj), An(Anj)). Assume that 



u) = E exp ( itMn,a + iv 



N„. 



n 



-> e 



- (<^Pa +2<V7a -\-V^)/2 



with pI^ = h[x)dx and 7 ^ = Jg^g{x)dx, where h{x) and g[x) are some 
measurable functions such that 70 = J^d g{x)dx < pi = h{x)dx < 00 . 
Then 

00 

(^n(Anj), j ypQ ■” 7 o ^ A( 0 , 1 ). 
j=l 



Proof of Theorem 1 . Because of min{a, b} = ^{a -h b - \a - b\) 

= “(1 ~ Jn)^ ( 7 ) 

where Jn = \'^n{Anj)\ with r„(A) = Pn{A) - 2i'n{A), therefore the first 

statement of Theorem 1 is equivalent to 

{Jn - (1 - 2Cn )) 4 A(0, 1). (8) 

In order to prove ( 8 ) let 



where f„(A) = fin{A) - 2i>n{A). So we need to prove that 



n 



1/2 



£(|r„K,)|-E|f.(A.,)|)| 

J 



/v^4tv(0,i), 



for which we will use the Proposition. We choose the functions pnj as 
9nj{'^i'^) 2u\ ^ \fln{Anj) 2l)n{Anj)\) (j = 1 , . . . ^TTln) 
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and let h{x) = f{x) and g{x) = \D{x)\f{x), so 



Po 



7o = 1 - [ \D{x)\f{x)dx 2L\ 



Now, we check the conditions of the Proposition. Introduce 



Nn-n 



Sn ty/ri v , — 

j=l 

for which a central limit result is to be shown to hold. Remark that 

Tfln 

Var{Sn) = 7lY^[t^Var{\fn{Anj)\) + 2tvE (|fn(Aj)|(/in(^nj) - ld{Anj)))]W, 
Apply Lemma 1 for Zi = I{XieA}{'^ - ‘^Yi) and p = 2 then 

\nVar{UA)) - f,{A)\ < + M)) < (9) 



therefore, on the one hand we get that 



l/2f 



limsupny; Var(|f„(^„j)l) < p\. 
" j=i 



Obviously 

Tfl-n 

liin inf n'^V ar{\fn[Anj)\) 



j = l 



> liminfn^f^ar(f„(yl„j)) - limsupn^((E|f„(A„j)|f - (Ef„(A„j))^) > pi, 

j=i " j=i 



if 



limsupn5])((E|f„(yl„j)|)2 - r(A„j)^) = 0. 

” j=i 

It can be easily seen that by Lemma 2 for a set A 
n((E|f„(^)|)2 - T{Af) < 

Because of (11) 

1 9 

n((E|f„(^)l)2 - T{Af) < -Var{nfn{A)) < p{A) + -j^p{Ay^^. 
From these by the inequality 

min(a 4- 6, c) < min(a, c) + min(6, c) < min(a, c) + 
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we get that 

mn 

nY,mfn{Anj)\f-r{A^^f) 
j = l 

mn 

j = l 

rUn 
J = 1 



if P{D{X) = 0) = 0. To see this limit introduce the functions fn{x) = 
and D then 

X{An(x)) J^n[X) — ^(^An{x))^ 



TTln 

^ min {%n^i{Anjfe~'^AA„A'^l{MAnA)^^(^Anj)) 
i = l 

= ^ min [^^l{A{nJ)eA'^^^^AA)!n{x)Dn{xf+\oin^ 

< [ min(8e-5"^(^-»W)/"W°"W'+'°g",l)/(a;)da; 



where the convergence is dominated, so we need to show that for f{x)>0 
the integrand tends to 0 almost everywhere w.r.t. A, for which we can apply 
Theorem 7.16 in Wheeden, Zygmund (1977). The second term in (10) is 
similar. To complete the asymptotics for Var{Sn) it remains to show that 

mn „ 

nY,'^i%{Anj)\{ij.n{Anj) - KAnj))) “> 7a = / \D{x)\f{x)dx 3S n — )> oo. 

j=l 

For a set A 

nE i\f„{A)\{fi„{A) - fi{A))) = nE(|f„(^)|A„(7l)) - nE(if„(^)|)M(7l) 

For a set A let Y {A) be a random variable with 

P(F(yl) eB) = F{Y e B\x e A) 
then it can be easily seen that 
nEi\UA)\{^n{A) - n{A))) 

= ^i{A) f; (e| £(1 - 2Yi{A))\ - E| f:(l - 2r,(yl))|) (^M^))% -n^(A) 

jfc=0 \ i=l i=l ) 

Without loss of generality assume that t{A) > 0 then 

/ k+l k \ 

H{A) E| ^(1 - 2Yi{A))\ - E| X:(l - 2Yi(A))\ 

V i=l i=l / 
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( / A;+l fc+1 

E 5:(1 - 2Y,{A)) + 2E(- $:(1 - 2Y,{A))y 
V i=l 2=1 



E X:(l - 2Y,{A)) -f 2E(- J:(1 - 2Y,{A))Y 



2=1 



2=1 



/c-|-l 



= t^{A) E(1 - 2YM)) + 2E(- 2Yi{A)))+ - 2E(- - 2Y{A))y 



\ 



2=1 



2=1 



Because of |2(- - 2yi(^)))+ - 2(- E?=i(l - 2Fi(^)))+| < 2 apply 

Hoeffdings inequality as in the proof of Lemma 2 

( fc+l k \ 

E| 5](1 - 2F,(^))| - E| ^(1 - 2F(A))| - /x(A)E(l - 2Y,{A))\ 

2=1 2=1 / 

( k+1 k \ 

E| - 2Yi{A))\ - E| $:(1 - 2F(^))| - |r(^)l| 

2=1 2=1 / 

k-\-l k 

= 12E(- 5^(1 - 2F,(A)))-2E(- ^{1 - 2F(A)))+| 

2=1 2=1 

/ fc +1 fc 

< M^) P{E(i - 2 F(+) < 0 } + p{E(i - 2 F(+) < 0 } 

\ i=l i=l 

< 2n{A) (eAM)r{A)y(2n{Af) ^-kr{A)y{2^{Ay)^ 

Summarizing the consequences of these inequlities we get that 

rrin mn 

- n{An,))) - E \r{An 



^njJ 



J=1 

< 



j=l 



E ^KA„j) E C-'‘AAniyH2MA„i)^) i'^tJ'iAnj))'‘ ^-n^(A„,) 

j=l *=0 



< '^A^i{Anj)e '^AAni? KMAnj)) 
J = 1 



therefore 

run 



n 



yrt ui,n P 

EE(|TVi(^ni)|(Mn(^nj) “ /^(^nj))) = o(l)+E / 1-0(2;) |/(a:)clx = 7„. 

;=l j=i 

To finish the proof of 



Sn ^ N (o, 4- 2iv^ot + as n -> oo 
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by Lyapunov’s central limit theorem, it suffices to prove that 

( TTln 

i\UAnj)\ - E|f„(A„,)|) + vi^Anj) - ti{Anj)) |®) 

oo 

j=mn-\-l 

or, by invoking the inequality, that 

rUn 

and 

c» 

~ ^ 0. 

The second limit is shown in Beirlant, Gyorfi and Lugosi (1994). Concerning 
the first one apply Lemma 3 and Lemma 1 for p = 3. □ 

The proof of Theorem 2 is similar, it is left to the reader. 




Lemma 1 (Beirlant, Mason (1995)) Let {Zi} i.i.d. random variables with 
E|Zi|^ < oo, let Nn be a Poisson (n) random variable independent of {Zi}, 
and N a standard normal random variable. Then there is a constant Ap 
(depending only on p) such that for 1 < p < 3 



E 



E&i - nEZ, " 
^Jn'EZl 



E|7V|P 




( E|Zi|3 
V(EZj2)3/2 




Lemma 2 For a set A 



0 < E|nf„(A)| — \nr{A)\ < 2np{A)e 



nr(A)^ 



Lemma 3 If Z is a random variable with E|Z|^ < oo then 
E||Z| -E|Z||^ < 27E\Z-EZ\\ 
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Abstract: A very important part in the process of transmission of information is 
that of generation of the words. For some time past we tried to consider the 
process of generation of the words by generative systems, from a stochastic 
point of view involving Markov chains. Because the sequences of intermediate 
word (called derivations) by which a word can be generated are finite, it results 
that finite Markov chains will be connected to the process. In order to our 
discussion should be as general as possible we consider generative systems free 
of any restrictions. The model of such systems is offered by the most general 
class of formal grammars from the so-called Chomsky hierarchy, namely the 
phrase-structure grammars. The novelty that we have proposed consists in a 
new organization of the process of generation of the words by considering the 
set of all the derivations split into equivalence classes, each of them containing 
derivations of the same length. In this way our results are available up to an 
equivalence. 

Keywords: generative systems, Markov chains, transition matrix. 



1. The probability of generation of the words 

We consider the process in which a word is obtained by a sequence of 
intermediate words starting from a fixed initial symbol. o=m’o, wi, .. ., We 
refer to such a sequence to be a derivation of length j. In this way, in each word 
is stored the information obtained previously such that the process has a 
Markovian behavior. At the end Wj is just a word written as a sequence of 
symbols belonging to a known alphabet, called terminal alphabet. 

We impose no restrictions on the passing from an intermediate word m’i to the 
next M’,+ 1 , so that the generative system is free of any restrictions. Because the 
case of generation of a word directly from a is of no particular interest, we 
suppose all the time that the length of a derivation is > 2. 

From now on we shall consider that the set of all the derivations is split into 
equivalence classes, each of them containing derivations of the same length. 
Thus, the results will be available up to an equivalence. 
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Let A, ^'^'2, be the equivalence class of the derivation D{x) and let us denote by 
Vx the number of derivations into the class A by which a word m’ is generated. 
Obviously, it can or cannot be generated into the class A. Thus, if it is then, the 
probability that it will be generated again into A+i is denoted by y ; but given 
that M' is not generated into A, the probability that it should be generated into 
A+i is denoted by /?. Hence we are in the case when the equivalence classes of 
the derivations are connected into a simple Markov chain. Furthermore, w can or 
cannot be generated directly from cr such that the probabilities that it will be or 
will not be in these cases are unknown to us. If we agree to denote by Di the 
class of derivations of length 1, it means that there is not an equivalence class 
preceding it. Then, let p\ be the probability that w should be generated into Di 
and qi=l-p\ the contrary probability. Now let px be the probability that w should 
be generated into A and qx=\'Px the contrary probability. The following result is 
obtained 

Theorem 1. A M’ordM> is generated into the class A, ^^2, if 



Px^ 



p 

\-5 



where 8= y- fi. 



( 1 ) 



But the constant to which px tends does not depend on the probability p\. 
Because it plays the role of a limiting probability we introduce the notation 



P = 



p 

1-5 



and = 



such that we have 

Px = (p\-p)S"'' + p. (2) 

Now, let us propose to determine the probability that w should be generated into 
the class Dj if it was already generated into the class Z),, i<j. We denote by 
the probability that w should be generated into the class A if generated 
into the class A, i<j- In this case the following theorem is proved 

Theorem 2. The probability^ that a word shoidd be generated into the class Dj if 
it was generated into the class D„ i<j, i>2, y>2, is given by the equality 



pP =p +qd’" 



(3) 
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2. The main characteristics of Vx 

It is obvious that Vx defined before is a random variable that takes on values 0 
and 1 with probabilities qx=l~Px and px respectively. Then, the number of 
derivations in n-\ equivalence classes, by which the word w is generated, is 

V-t V. 

such that its expectation is 

Et/=X Evx while the variance is 01^=2] 

;r=2 x^2 

The main result is the following 

Theorem 4. The expectation and the variance of the random variable giving 
the number of derivations by which a word is generated are as follows 

\ + 5 

^v^{n-\)p^an and J^v=pq[n - — + 

1- d 

where a„ and bn are some quantities that remain bounded as n increases. 



3. The alternating generation procedure 

Now we propose to study the case when a word w is generated by m 

equivalence classes from n on the following conditions: 

1 . It will be generated by the first class and the last and there is a direct rule 

2. It will be generated by the first class but it will be not generated by the last 
and there is a direct rule (cr,w); 

3. It will be not generated by the first class but it will be generated by the last 
and there is not a direct rule (o;w); 

4. It will be not generated both by the first class and the last and there is not a 
direct rule (<J,w). 

We call such a way for generating words an alternating generation procedure. If 

Pn{m) is the notation for the probability that we want to determine then, we have 

Pn{m) = Pn{m,ww) + Pn{m,ww) + Pn{m,ww) + Pn{m,ww) 

where Pn{m,ww) is the notation for 1 ., and so on. 
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For example we obtain in the first case; 



P„{m,ww)=piZ^ r 
k=2\k - 1 / 






n-m~\ 
. k-2 



\-p) 



n-m~k-^l n ^'-1 






Now by computing these four probabilities we find the following main result 
which is of the form of a limit theorem 

Theorem 5. If the word w is generated by an alternating generation procedure, 
the derivations belonging to n equivalence classes then, the probability that w 
should be generated by m classes out of n is given by the relation 



Pn{m) = 



^jlmpq - P 






where dn vanish uniformly when z lies in some finite interval and z verifies the 
equality 



m= np + z 



my{\ + m)p (1 - yg ) 



4. The fork-join generation procedure 

Now we shall define a new procedure for generating words. We take into 
consideration only the case when a word cannot be generated by an equivalence 
class of derivations. Thus, if a word is not generated by a class Dx, x>2, then it 
will be generated by the class Dx-i with probability q and by the class Dx+i with 
probability p~\-q. Relating to the first class and the last we suppose that a word 
can or cannot be generated by these two classes. But for the cases when it is n^ 
generated we put the following supplemental conditions; 

1. If the word is not generated by the first class Z>2 then, it will be certainly 
generated by the next; 

2. If the word is not generated by the last class then, it will be certainly 
generated by the last but one. 

It is referred to as a fork-join generation procedure. For the other all classes Dx, 
2<x<n, we suppose that the word being in each of them, is subject to a fork-join 
generation procedure. The following cases can be distinguished; 

1 . The word will be generated by the first class and the last class; 

2. It will be generated by the first class but it will be not generated by the last; 

3. It will be not generated by the first class but it will be generated by the last; 

4. It will be not generated both by the first class and the last class. 
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For each of them we have determined the two-step transition matrix and we 
have come to the following results; 

• The rows of rank /= 3 , 4 , . . contain, each of them, the triplet of elements 

Ipq, disposed with (f and p" on two diagonals to the left and 
respective to the right of the main diagonal which contains the element 2pq; 

• The first two and the last two rows are different from a case to another. For 
these rows we have: 

(i) the first case: 

Pn=Pn-l P2l=q, P22=Pn.2n.2=pq, P 24 =P^ , Pn-2 n- 4 =q^ WCld p„.2„.\=P, 

(ii) the second case 

Pir l.P2\=Pn-l n-3=q. Pn-\ n-\^P. Pll^Pq. P2A=P^ . Pn-2 n-A=q^ 

mdp„.2n.2=P^qp. 

(iii) the third case: 



pu=q. Pu=Pn-2 n-l=p. Pn-\ . P22=q^Pq. P2A=P^. Pn-2 n-A^({ 
mdpn-2n-2=qP\ 

(iv) the fourth case: 



Pu=Pn.\ n-3=q. P\3=Pn-\ n.\=P. P22=q+Pq. P2A=P^. Pn-2 

mdpn-2n-2=p^qp 



The main result is a specific property of symmetry of these four case that can be 
stated as follows 

Theorem 6, If a word is in a random process of generation by a fork-join 
generation procedure then, in all the cases of generation, the tu o-step 
transition matrix has n-5 successive rows, each of them containing the triplet of 
elements Ipq, p^, symmetrically disposed as against the first tM O and the last 
tMo rows. Furthermore, cf and p^ are elements of two distinct diagonals 
symmetrically disposed as against the main diagonal which contains the 
element Ipq. 

As regards the first and the fourth cases, they have conducted us to a surprising 
result. Let us denote by A\ the event consisting in the word being generated by 

the class D 2 , by A 2 it being generated by the class D 3 , , by A„.i it being 

generated by the class Z)„. 

Now in the first case, the two-step transition matrix is the same with the two- 
step transition matrix for a particle in a random walk between two absorbing 
barriers known in the theory of Markov chains. For example let us consider that 
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a particle located on a straight line moves along the line via random impacts 
occuring at times t\, ti, h, ... . The particle can be at points with integral 
coordinates a, a+\^ a+2, .... h. At points a and b there are absorbing barriers. 
Each impact displaces the particle to the right with probability p and to the left 
with probability q=\-p so long as the particle is not located at a barrier. If the 
particle is at a barrier then, it remains in the states /I i and An.\ with probability 1. 

As regards the fourth case of generation of a word by a fork-join generation 
procedure, the two-step transition matrix is the same with the two-step transition 
matrix for a particle in a random walk between two reflecting barriers also 
known in the theory of Markov chains. The conditions remain the same as in the 
former case, the only difference being that if the particle is at a barrier then, any 
impact will transfer it one unit inside the gap between the barriers. 

These cases of generation of the word by a fork-join generation procedure 
become of a special interest. We also emphasize their practical caracter and we 
believe that they will be very useful in some studies on generative systems. 
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Abstract: A model for simultaneous optimization of combinations of test-based 
decisions in education and psychology is proposed using Bayesian decision 
theory. To illustrate the approach, one classification decision with two 
treatments each followed by a mastery decision are combined into a decision 
network. An important decision is made between weak and strong decision 
rules. As opposed to strong rules, weak rules are allowed to be a function of 
prior test scores in the series. Conditions under which optimal rules take weak 
monotone forms are derived. Results from a well-known problem in The 
Netherlands of selecting optimal continuation schools on the basis of 
achievement test scores are presented. 

Keywords: Bayesian Decision Theory, Classification Decisions, Mastery 
Testing, Simultaneous Optimization, Compensatory Rules. 



1. Introduction 

Statistical decision problems arise when a decision-maker is faced with the need 
to choose an action that is optimal in some sense. In practice, one decision 
problem may often lead to another, which, in turn, may lead to a third one, and 
so on. A well-known example of combinations of test-based decisions is an 
Individualized Study System (ISS) in education. Such systems can be conceived 
of as instructional networks consisting of various types of decisions as nodes, 
with individual routes for the students (e.g., van der Linden & Vos, 1996). 

An appropriate framework for dealing with instructional decision 
making in ISSs is Bayesian decision theory (e.g., Lindgren, 1976, Chapter 8). 
The application of Bayesian methods to, for instance, instructional decision 
making in ISSs, consists of two basic elements: A probability model relating 
test and criterion scores to each other, and a utility structure evaluating the total 
costs and benefits of all possible decision outcomes. To optimize series of 
decision problems within a Bayesian decision-theoretic framework, it is 
common to maximize the expected utility for each decision separately. 

The purpose of this paper is to show that multiple decisions in networks 
can also be optimized simultaneously by maximizing the expected utility over 
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all possible combinations of decision outcomes. Since more information usually 
results in better decision procedures, it is expected that a simultaneous approach 
has two main advantages compared with optimizing each decision separately. 
First, it is expected that efficiency of decisions can be increased using data from 
earlier tests as collateral information in later decisions. Second, more realistic 
utility structures can be obtained defining utility functions for earlier decisions 
on the ultimate success criterion in the network. 

To illustrate the simultaneous approach, one classification decision with 
two treatments each followed by a mastery decision are combined into a 
decision network. Examples of mastery decisions include pass-fail decisions in 
education, certification, and successfulness of therapies. Well-known test-based 
classification problems are educational or vocational guidance situations where 
most promising types of schools or careers must be identified, classification of 
students into ISSs with tracks at different levels, selecting one out of different 
available therapies in clinical settings, and testing for the military service. 



2. The Classification-Mastery Problem 

In classification decisions, the decision problem consists of a choice among 
several alternative treatments to which subjects have to be assigned on the basis 
of their test scores. Prior to the treatments, all subjects are administered the 
same classification test and the success of each treatment is measured by its 
own criterion. Therefore, completion of each treatment is followed by different 
mastery tests, which the subjects may pass or fail. 

In the following, we shall suppose that the test scores observed prior to 
the treatment for an arbitrary individual are denoted by a random variable X. 
Each treatment j is followed by a mastery test, with scores denoted by a random 
variable Yj (j = 0,1). It is assumed that the mastery tests Yj are unreliable 
representations of the criteria measuring the success of the treatments. The 
criteria to be considered are the classical test theory true scores (e.g.. Lord & 
Novick, 1968) underlying the mastery tests Yj, and will be denoted by Tj. The 
variables X, Yj, and Tj will be considered to be continuous, with realizations x, 
yj, and tj. Furthermore, it is assumed that the classification of subjects into j 
treatments yields a joint distribution of scores X, Yj, and Tj with density 
function fj(x,yj,tj). 

The mastery score variables Yo and Yi will be rescaled such that they 
both take values on the same domain. As a result, for the realization yo and yi of 
the random variables Yo and Yi the indices 0 and 1 can be dropped in the 
remainder of this paper. Of course, this does not mean that a subject does 
receive the same value for y if s(he) follows different treatments. Similarly, 
since Tj is defined as the expectation of Yj according to classical test theory, the 
indices 0 and 1 will be dropped for the realization to and ti of To and Ti. 
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3. Weak Monotone and Strong Monotone Rules 

It is assumed that we can restrict the class of all possible simultaneous rules to 
monotone rules, that is, rules taking the form of cutting scores. Let Xc, ycj, and Lj 
denote the cutting scores on the random variables X, Yj, and Tj, respectively, 
where Lj is set in advance by the decision maker (j = 0,1). Clearly, the 
classification-mastery problem now consists of simultaneously setting cutting 
scores Xc and y^ such that, given the values of Lj, the expected utility over all 
possible combinations of decision outcomes is maximized. Note that the first 
advantage of the simultaneous approach (i.e., making more efficient use of the 
data) is nicely demonstrated by the fact that in optimizing mastery decisions, 
prior data on the classification test is used as collateral information. 

In general, the observed scores on the classification test may or may not 
be taken explicitly into account in setting cutting scores on the mastery test 
score variables Yj. As a case for the former, it seems reasonable that students 
who have been assigned to Treatment 1 with observed classification scores far 
above Xc are sooner granted mastery status than students with classification 
scores equal to or just above Xc. To distinguish between cases where cutting 
scores on the mastery tests do or do not depend on x, those rules will be denoted 
by weak and strong monotone rules, respectively. For each x, weak cutting 
scores on the mastery tests Yo and Yi have to be computed as functions yco(x) 
and yci(x), respectively. 

Let ajh stand for the action either to retain (h = 0) or advance (h = 1) a 
subject who is classified into treatment j, then a weak simultaneous rule 6 (i.e., 
a mapping from the sample space X x Yj onto the action space) can be defined 
for our classification-mastery problem as follows: 



{(x,y): 

{(x,y) 

{(x,y) 

{(x,y) 



5(x,y) = aoo} 
6(x,y)= aoi} 
5(x,y) = a lo} 
6(x,y)= an} 



= A X Bo(x) 

= A X Bo^(x) 

= A*^x Bi(x) 

= A^ X B|^(x), 



( 1 ) 



where A, Bj(x), and Bj^(x) stand, respectively, for the sets of x and y values 
for which a subject is classified into Treatment 0, into Treatment 1, retained in 
treatment), and advanced in treatment). Note that at the left-hand side as well 
as at the right-hand side of (1) we have subsets in the X x Yj-space, since both 
subsets A and A^ are sets of x values while both subsets Bj(x) and Bj^(x) 
represent sets of y values depending on the variable x. It follows from (I) that a 
weak monotone rule 6 (X,Yj) can be defined for our example as: 
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S{X,Y.) = 



forX< <y,o(x) 
forX<x^.,Y„ >yeo(x) 
forX>;c,,y, <y^,(x) 
foiX>x^,Y,>y^,{x). 



( 2 ) 



Conditions under which we can confine ourselves to the subclass of 
monotone simultaneous rules in looking for optimal rules will be considered 
below. 



4. An Additive Utility Structure for the Combined Problem 

A utility function Ujh(t) evaluates the consequences of taking action ajh while 
the true score of the subject is t. In the present paper, it is assumed that the 
utility structure of the classification-mastery problem can be represented as an 
additive function of the following form: 

UOh(t) = WiUo'"’ (t) + W2U0h""^(t) 

( 3 ) 

Ulh(t) = (t) + W3Uih^"’^(t), 

where represent the utility functions for the separate 

classification and mastery decisions under treatment j, and wi, W 2 , and W 3 
represent nonnegative weights. A common weight wi has been used in both 
functions because utility scales have an arbitrary zero (Luce & Raiffa, 1957). 

Since utility functions are also measured with an arbitrary unit, and 
assuming W 2 = W 3 (i.e., the relative influence of the utility functions for both 
mastery decisions are equally weighted), the weights in (3) can always be 
rescaled, leading to the following result: 

Ujh(t) = wuj'%) + [(l-w)/2]Ujh'"”(t), (4) 

where 0 < w < 1 . 

It should be emphasized that both terms in the right-hand side of (4) are 
a function of the same variable. As indicated by van der Linden and Vos (1996), 
therefore, the utility structure assumed is not an instance of a utility function in 
a multiple-criteria decision problem. The usual assumptions needed to motivate 
an additive structure for multiple-criteria functions in the Luce-Raiffa sense to 
provide a utility function are thus not needed here. 

Note that the second advantage of the simultaneous approach (i.e., more 
realistic utility structures) is effectively demonstrated by the additive utility 
structure of (4), in which utility functions defined on the ultimate criteria 
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variables To and Ti are also used in previous decision problems, namely, the 
problem of classifying subjects into Treatment 0 and Treatment 1. This choice 
is in line with the philosophy underlying most ISSs. 



5. Sufficient Conditions for Monotone Simultaneous Rules 



For every integrable function f(x) and any set S of x values, it holds that 

j f(x) dx < j f(x) dx with So = {x: f(x) > 0}. (5) 

S So 

The theorem in (5) is implied by the obvious inequality f(x) < | f(x) | . Applying 
this theorem to the expected utility in the simultaneous approach, the sets 
and Bj^(x) can be derived for which an upper bound is obtained for this 
quantity. Next, it can be examined under which conditions the sets A^ and 
Bj^(x) take a weak monotone form. 

Let E[Usim(A^, Bo^(x), Bi^(x))] denote the expected utility in the 
simultaneous approach, then this quantity is calculated over all possible 
combinations of decision outcomes by adding the expected utilities associated 
with the four possible actions aoo, «oi, and an. It follows that 

E[Usim(A^, Bo^(x), Bi^(x))] can be written as: 

E[Usi4A^,Bo^(x).B,''(x))] = 

1 I 1 uoo(t)fo(x,y,t)dtdydx + j j j uoi(t)fo(x,y,t)dtdydx + 

A Bo(x) R A Bo'^(x) R 

I I i Uio(t)fi(x,y,t)dtdydx + i | j Un(t)fi(x,y,t)dtdydx, 

A^ Bi(x) R B|^(x) R 

( 6 ) 

where R denotes the set of real numbers. Taking expectations, completing 
integrals w.r.t. y, and rearranging terms, (6) can be written in its posterior form: 

E[Usi4A^, Bo^(x), B,^(x))] = 

i Eo[uoo(To) 1 x]q(x)dx + J E|[uio(T|) 1 x]q(x)dx + 

A A*^ 

I j Eo[uoi(To)-uoo(To) I x,y]ho(x,y)dydx + 

A Bo^(x) 

i I Ei[uoi(T|)-uoo(T|)l x,y]hi(x,y)dydx, 

A*^ B.'^Cx) 



( 7 ) 
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with hj(x,y) denoting the p.d.f. of (X,Yj), and where Ej[.] indicates that the 
expectation has been taken over a distribution indexed by j. Applying the 
theorem in (5) first to the inner integrals w.r.t. y, and using hj(x,y) > 0, it can 
easily be verified that for each value x of X an upper bound to (7) is obtained 
for: 



B*o‘'(x) = {y; Eo[uo,(To)-uoo(To) I x,y] > 0}, (8) 

B*,‘=(x) = {y; E,[u„(T|)-u,o(T,)|x,y] > 0}. (9) 

Next, inserting the sets B*o'^(x) and B*i‘^(x) into (7), completing integrals w.r.t. 
X, and rearranging terms, (7) becomes 

E[Usi„(A^ Bo^(x), B,^(x))] s 

Eo[uoo(To)] + 1 I Eo[uoi (To)-uoo(To) I y]ko(y 1 x)dy } q(x)dydx + 

R B*o^(x) 

i {E,[u,o(T,)lx]-Eo[uoo(To)|x] + 

1 

S (2j-l) 1 Ej[Uj,(Tj)-Ujo(Tj) I x,y]kj(y I x)dy}q(x)dx, (10) 

j=0 B*j^(x) 



where kj(y I x) and q(x) denote the p.d.f.’s of Yj given X = x and X, respectively. 
Finally, applying the theorem in (5) to the outer integral w.r.t. x, and using 
q(x) > 0, it can easily be verified that an upper bound to (10) is obtained for: 

= {x: E|[uio(Ti) I x] - Eo[uoo(To) I x] + 

1 

I(2j-l)j Ej[uj,(Tj)-Ujo(Tj)|x,y]kj(y|x)dy > 0}. (11) 
j=0 B’j^(x) 

Conditions for weak monotonicity can now be specified. For weak 
monotone rules, the sets B*j^(x) and take the form [ycj(x),c«) and [Xc,«^), 

respectively. Hence, optimal rules take a weak monotone form if the left-hand 
sides of the inequalities in (8)-(9) and (1 1) are increasing functions in y for all x 
and in x, respectively. In the empirical example below it will be examined if 
these conditions hold. 
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6. Calculation of Optimal Weak Cutting Scores 

Assuming the conditions for weak monotonicity are satisfied, optimal weak 
cutting scores can now be obtained for those values of ycj(x) and Xc for which 
the inequalities in (8)-(9) and (11) turn into equalities. Since (8)-(9) hold for all 
X, and thus for Xc, first the optimal weak cutting score on the classification test 
can be found by equating the left-hand sides of (8)-(9) and (11) to zero and 
solving the system of equations simultaneously for Xc, yco(Xc), and yci(Xc). Then 
for each x < Xc and x > Xc, optimal weak cutting scores on the mastery tests Yo 
and Y i can be obtained by putting the left-hand sides of (8) and (9) equal to zero 
and solving for yco(x) and yci(x), respectively. 

Since no analytical solutions for the system of equations could be found, 
Newton’s iterative algorithm for solving nonlinear equations was used. 
Assuming a trivariate normal distribution for fj(x,y,t), a computer program 
called NEWTON was written to calculate the optimal weak cutting scores. 



7. Empirical Example and Discussion 

The empirical example concerns the assignment of students to appropriate 
continuation schools at the end of the elementary school (i.e., at grade 8), a 
problem well-known in the Dutch educational system. The utility functions 
Uj^^^^t) and Ujh^"^\t) involved for the separate classification and mastery decisions, 
respectively, were assumed to be the well-known threshold function (e.g., 
Huynh, 1976). The choice of this function implies that the costs and benefits of 
each possible decision outcome can be summarized by a constant. 

It appeared that the left-hand sides of the inequalities in (8)-(9) were 
increasing functions in y for all values of x. Hence, the sets B*j^(x) took the 
required form [ycj(x),oo) for all x. Inserting this result into the left-hand side of 
the inequality in (1 1), it appeared that this expression was increasing in x. Thus, 
the set A*^ also took the required form [Xc,°°) implying that the monotonicity 
conditions were satisfied. 

It turned out that, under the natural condition of the distributions of 
classification and mastery scores being stochastically increasing in the true 
mastery score (e.g., a trivariate normal distribution for fj(x,y,t)), optimal weak 
cutting scores on the mastery tests were decreasing functions of the 
classification scores. Hence, optimal weak monotone rules were necessarily 
compensatory by nature. In other words, for instance, students with observed 
scores on the classification test X equal to or just above the cutting score Xc 
must compensate their relatively low classification scores with higher scores on 
the mastery test Yi. 

As a result of making more efficient use of the test data in a 
simultaneous approach, one might expect a gain in expected utility compared 
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with optimizing each decision separately. To see whether this expectation could 
be confirmed, the expected utilities for the optimal weak monotone rules were 
compared with the weighted sum of the expected utilities for the optimal 
separate rules. The optimal separate cutting scores could easily be derived by 
imposing certain restrictions on the expected utility for the simultaneous 
approach. The results indicated that the expected utilities for the simultaneous 
approach were substantially larger than for the separate approach. 

Two final remarks are appropriate. First, the procedures advocated in 
this paper have a larger scope of application than assigning students to optimal 
types of comprehensive education. For instance, the classification-mastery 
problem may be important in the area of psychotherapy in which patients have 
to be classified into the most appropriate therapy followed by a test, which has 
to be passed before they can be dismissed from the therapy. Generally, any 
situation in which subjects have to be assigned to one out of several possible 
treatments, training programs, or therapies on the basis of their scores on a test 
and in which each treatment is followed by a different mastery test, the optimal 
rules presented here have the potential to improve the quality of the decisions. 
Beside the areas of selecting optimal continuation schools and psychotherapy, 
the optimal decision procedures may be important in management sciences, 
agricultural sciences, medical decision making, and so forth. 

Second, the problem of simultaneously optimize classification decisions 
with two treatments each followed by a mastery decision have been dealt here 
by applying Bayesian decision theory. Flowever, other general classification 
approaches may also be appropriate, such as discriminant analysis (e.g., 
Tatsuoka, 1971, Chapter 8; McLachlan, 1992) or the use of dissimilarity 
measures techniques (e.g., Mirkin, 1996). 
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Abstract : In images’ discriminant analysis, training sets are usually given by 
experts of the field of interest. A procedure is proposed in this paper to avoid 
such a resort to experts. Two steps are required to achieve this scheme. Firstly, 
thanks to a multivariate and nonparametric approach of supports comparison, 
some homogeneous areas are detected on the image. Secondly, building a 
similarity measure based on the same criterion, those so found areas are 
merged into a small number of classes. Afterwards, these classes can be used 
as training sets for any discriminant analysis procedure. 

Key words : training sets, images, support comparison, discriminant analysis, 
homogeneous areas, experts. 



1. Introduction 

Using an interdisciplinary approach inside the GEOSATEL Laboratory, 
mathematicians and geographers are working together with a view to joining 
the results of a long statistical research with a geographical application on 
environmental planning. 

For several years, we have been interested in discriminant analysis applied to 
multispectral images at satellite scale. A supervised classification scheme has 
been elaborated which appears highly efficient on remote sensing data. 
Recently, we applied our procedure on airborne photographs and the 
classification results looked also very good. 

As well at satellite as at airborne scale, the training sets we used were given by 
geographers. The use of such experts to get training sets can be expensive and 
their choice of training data can be subjective. 
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Therefore we wanted to pass to an unsupervised method to avoid those 
drawbacks. Some classical segmentation algorithms like edge detection, 
splitting and merging, were first tested but the provided results let us 
unsatisfied. 

As a clustering of the entire image would lead to useless results, we thought 
then to automate the training sets selection procedure. This automation is 
discussed hereafter. We detail firstly how homogeneous areas are detected, and 
secondly how they are merged into training sets. The last paragraph is devoted 
to applications on satellite and airborne images. 



2. Detection of homogeneous areas in the image 

Due to mixels - pixels at the intersection of several regions in an image - we 
have to reduce the available information to the useful information. The 
philosophy consists of detecting homogeneous areas on the image, including 
not only the spectral intensities of the pixels but also the texture of the regions. 

To detect changepoints -dual of finding homogeneous areas- Stephens and 
Smith (1993) propose a parametric way: use on each row and column of the 
image a test of hypothesis to detect if there is a changepoint or not. 

In the same time, we pay attention to another approach. Using also parametric 
models, Wichern, Miller and Hsu (1976) propose a method to find sets of 
contiguous pixels containing almost surely no changepoint (up to a a-level of 
error) . 

Based on those two points of view, we examine if a pixel is a changepoint - on 
a row for example - by testing the equality of the densities of the samples of 
size n on the left hand side and on the right hand side of this pixel. A pixel is a 
changepoint if it is a changepoint either on a row or on a column. 

Due to the poor results we obtain we decide to modify our strategy : the test is 
no more applied to detect change of parameters but to detect change of 
distribution, whatever the direction may be. In order to perform this, classical 
non parametric tests such as Kolmogorov-Smirnov one are used. 

In fact, as we apply this method on multispectral images, each coordinate of the 
distribution is treated separately so the complete distribution is not taken into 
account. This can explain the weak results we obtain. 
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We finally found an approach able to consider the fiill space in the paper of 
Devroye and Wise (1980). This approach, justified by asymptotic theorems, 
consists of a densities supports comparison. 

So we deduce our procedure as follows : 

• first we estimate the supports on a range of the same size on the left 
hand part and right hand part of each pixel (for the horizontal step), 
and we do the same vertically and along the two diagonals ; 

• then we look at the number of points on one side being in the 
support of the points of the other side and inversely ; if the total 
number is sufficiently great in the four directions, the homogeneity is 
not rejected. 

The estimation of the density support is made via union of discs centered on the 
pixels in the spectral space. The choice of the radius generates obviously a 
difficulty in this procedure. 



3. Combining the homogeneous areas 

Having at our disposal a high number of homogeneous parcels, we merge them 
into a small number of classes. For this, we compute a similarity measure 
between zones thanks to the same strategy as for testing the homogeneity. So 
for each pair of zones, we calculate the proportion of points being in the 
support of the other class. Therefore our similarity is a measure between 0 and 
1 . 

With this similarity matrix, we can now group the areas by a hierarchical 
classification method like single linkage or complete linkage for example. We 
stop this process when the similarity between the groups go beyond a fixed 
level. A this time, we keep only the groups formed by the merging of several 
initial zones. We have now homogeneous groups at our disposal which are 
representative of different populations of the image. Those groups will be used 
as training sets. 



4. Applications 

We apply this scheme on satellite and on airborne scale images where the 
spectral space is composed of 3 channels. The satellite image is a scene of 
Marigny (France) and has been taken by SPOT satellite in early April season. 
The 3 SPOT channels were used (XSl, XS2, XS3). The airborne image is a 
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picture of the region of Arlon (Belgium) and is composed by the classical 3 
channels (Red, Green, Blue). The images are provided by IGN France and by 
the Walphot society (Belgium) respectively. 

For each image, we detect automatically the training sets using the single 
linkage as clustering method. Afterwards, we classify the image using the 
behavioral method elaborated in our laboratory (Rasson and Granville 1996). 
This procedure is based on the non homogeneous Poisson Process. For this, we 
estimate the density in each group by the kernel method. 

On the Marigny image, we succeed to detect the main classes. Nevertheless, 
the urban class is not recognized due to the small number of pixels in this 
group. The most interesting point is that several areas of the image are 
unclassified by our method. This point indicates that classes may be missing. 
With a view to reducing a too high number of unclassified pixels, the same 
scheme can be applied again on the unclassified areas only. By this way, some 
missing classes may be recovered. 

On the Arlon image, we detect more classes than the expert. In fact, one class 
of the expert representing fields is split in two classes of fields characterized by 
their drainage’s nuance. Thanks to that procedure we are also able to 
discriminate roofs and walls of the houses. That information is interesting for 
geographers and can be included in GIS (Geographic Information System) 
technology. 

5. Numerical comparison 

We compare some automatic training sets with the experts’ ones for the 
satellite picture. For this, we compare the mean and standard deviation for the 
classes which appear on both sets (table 1). 

Table 1 : comparison of mean and standard deviation of channels XSl, XS2 
and XS3 for traininv sets Qiven bv experts and automatic ones. 





Experts’ training 


sets 


Automatic trainin 


^ sets 




TOllITOfH 


Mean 


Std. 

Dev. 




Mean 


Std. 

Dev. 


Deciduous 


B 


77,24,22 


20,13,12 


1883 


69,21,19 


13,5,5 


Grassland 


m 


170,25,30 


35,18,15 


131 


203,18,30 


6,3,3 


Fields 


3800 


94,126,111 


27,44,41 


884 


78,136,108 


7,9,9 


Winter 

Cultures 


5600 


159,32,35 


29,22,21 


552 


158,24,27 


10,5,7 
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From table 1, we see that means are really different from one group of training 
set to the other. But the most interesting fact is that the standard deviation of 
the experts’ sets are always much greater than those of the automatic sets. So 
our algorithm can discover training sets much more homogeneous than those 
given by experts. 

Moreover, with the experts’ sets, it is difficult to make the distinction between 
grassland and winter cultures. In the automatic sets, there is a clear cut between 
the two classes as we can see on figure 1. 

Figure 1 : Comparison of distribution for grassland and winter culture 
training sets in the XS1-XS3 plane. Figure La : Grassland set given by 
experts. Figure Lb : Winter cultures set given by experts. Figure l.c : 
Automatic grassland set. Figure Ld : Automatic winter cultures set. 




6. Conclusion 

Using statistical tests, we are able to detect homogeneous areas inside an image 
up to a level of error. By tuning parameters, clustering methods provide 
automatic training sets which have several qualities : they are more 
homogeneous than those given by experts ; they are less subjective ; they can 
discover classes difficult to detect for the experts. 

Segmentation of images with such training sets gives information to 
geographers and can help them to solve environmental problems and to 
establish land planning. The provided results may be used not only to grasp the 
evolution of the land cover, but also to cross the land cover inventory with 
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other information useful for land planning through the GIS (Geographic 
Information System) technology. 
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Abstract:. An algorithm that identifies the variable subsets with most 
discriminatory power (in a predictive sense) is proposed. This algorithm minimizes 
parametric estimates of the error rate among all the possible variable subsets, 
evaluating only a fraction of the total number of subsets. The computational 
feasibility is illustrated by simulation experiments. 

Keywords: Discriminant Analysis; Variable Selection Techniques; Optimization 
Techniques. 



1. Introduction 

In many applied discriminant analysis studies data is collected on a large number of 
variables, that latter are to be reduced to a small subset. In the early stages of the 
analysis the researcher may want not to disregard any potentially interesting 
variable. However, discriminant functions with too many variables are usually 
difficult to estimate and interpret and it is well known that the inclusion of 
irrelevant or redundant variables in data-based classification rules tends to weaken 
its prediction power (Jain and Waller 1978, McLachlan 1992, pp 391-392). 

Most major statistical software packages implement stepwise selection methods 
that sequentially try to improve global measures of group discrimination. These 
methods tend to relay on statistical tests to measure the potential contribution of 
additional variables to the “global separation” between the groups (McKay and 
Campbell 1982a, Huberty and Wisenbaker 1992, Huberty 1994 pp 231-233, 
McLachlan 1992, pp 392-400). In spite of their popularity these procedures have 
important shortcomings: 

(i) The criterion used to select variables should take into consideration the 
nature of the problem and the objectives of the analysis. In particular, when 
the primary objective of the analysis is prediction, the variable selection 
process should attempt to maximize prediction power rather than group 
separation or discrimination (McKay and Campbell 1982b). 

(ii) Stepwise selection methods are suboptimal strategies that were developed to 
surpass the computational difficulties associated “all possible subsets” 
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selection methods. However, when computationally feasible, “all subsets” 
strategies are preferable. As stepwise methods analyze one variable at the 
time, they tend to ignore the combined effect of several variables. 



2. Choice of Selection Criterion 



In descriptive DA studies variables are usually chosen in order to maximize some 
index of group separation. The most common index used for this purpose is = 1 
- A (A being the well known Wilk’s statistic). 

In predictive DA studies variables should be chosen in order to optimize some 
estimate of prediction ability with good statistical properties. 

In two-group problems there are several parametric estimators of the conditional 
error rate that, under the usual assumptions of multivariate normality and equality 
of variance-covariance matrices across groups, tend to be superior to its 
nonparametric counterparts (McLachlan 1992, pp 369-370). For instance, 
Lachenbruch (1968) and McLachlan (1974) and have proposed respectively e/^^ 
and as estimators for the conditional error rate in group i. 
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where 0(«), (])(•) are the cumulative probability, and the probability density of a 
standard normal variate, D is the sample Mahalanobis distance between group 
centroids, ni is the number of training sample observations in group i, N = ni+n2, m 
= N - 2, and p is the number of variables used in the allocation rule. 

Estimators of the global conditional error rate can be defined as e^^^ = Ki e/^^ + 7C2 
62^^^ or e^^^ = 7Ti e/^^ + K 2 where Ki and 7C2 are prior probabilities of group 
membership. 

Once ni, n2 and p are fixed, and are simple functions of D. Thus, for a 
given subset size, minimizing e^^^ or e^^^ is equivalent to maximizing D. 
Furthermore, it is well known that D and are related by (3). 

^2^ 77^ 

/?! *«2 



( 3 ) 
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Therefore, for two-group problems, the use of V[ (or equivalently D) as variable 
selection criterion can also be justified from a predictive perspective. Furthermore, 
although D and r\ never decrease with the addition of new variables, e^^^ and e^^ 
provide two alternative ways for comparing subsets of different dimensions. 



3. Stepwise and All-Subsets Variable Selection Algorithms 

The most important methods for variable selection in DA can be broadly divided in 
two major classes; Stepwise and All-Subsets methods. Stepwise methods can still 
be divided into forward (methods that start with an empty subset) and backward 
(methods that start with the full set of potential variables). Both these approaches 
have some limitations. The most important problem with forward stepwise 
methods, is that they may fail to recognize the combined discriminatory power 
provided by groups of variables which individually look uninteresting. Backward 
stepwise approaches are not so seriously affected by this problem. However, since 
backward strategies typically start with a much larger number of variables than 
necessary, they usually introduce a considerable amount of noise that may seriously 
affect the selection process. 

The preferred strategy for variable selection, is to look at all the possible variable 
subsets, and choose the one who looks more “interesting”. However, in a problem 
with P potential variables, the number of different subsets equals 2^-1 which can 
easily be a formidable number. This problem also occurs in regression analysis, 
where a considerable research effort has been made in development of efficient 
algorithms for All-Subsets variable selection. For instance Furnival (1971) has 
showed, that performing successive Gaussian elimination operations on the Sum of 
Squares and Cross Products (SSCP) matrix (4), in an appropriate order, the 
Residual Sum of Squares (RSS) for the 



X'Y 



SSCP = 






rx I 0 



(4) 



regressions of Y on all the subsets of X can be evaluated with an effort of about six 
floating point operations per regression. McCabe (1975) has adapted Furnival 
algorithm to the DA problem, showing that the value of Wilk’s A, for all possible 
variable subsets, can be evaluated with an effort of about twelve floating point 
operations per subset. Furnival and Wilson (1974) have refined Furnival 
algorithm, showing that the best variable subsets for each subset size can be found 
without computing all the 2*"-l RSS. Furnival and Wilson algorithm combines the 
original Furnival strategy that moves fi*om SSCP, working in a forward fashion 
(adding variables to previous subsets) with a backward strategy that works fi’om 
SSCP \ deleting variables at each step. Working simultaneously their way up from 
SSCP and their way down from SSCP’\ Furnival and Wilson build a search tree in 
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a similar manner to branch and bound algorithms of integer programming. Using 
RSS from previous subsets as bounds, and the fact that the deletion of variables for 
a given subset can never reduce its RSS, large parcels of the search tree can be 
pruned. The details of the algorithm can be found in Fumival and Wilson (1974) 
and will not be repeated here. 

Furnival and Wilson algorithm can be readily adapted to the search for the best 
subsets (in the sense of D) in two-group DA. For that purpose, the SSCP matrix 
(4), should be replaced by matrix M (5). 



M = 



5 I 

(X,-XJ\ 0 



(5) 



where S represents the sample variance-covariance matrix, and and X^ the 
sample mean vectors for groups 1 and 2 . 

Matrix M is associated with an empty subset and matrix with the set that 
includes all variables. It can be shown that the lower-right comer element of 
M'^ equals -D^. Using Gaussian elimination operations to move, “by all possible 
paths”, from M in the direction of the identity matrix, the value of -D^ is 
constantly updated from one subset to another with one additional variable. In a 
similar manner, moving from M'^ in the direction of the identity matrix, the value 
of -D^ is constantly updated from one subset to another with one less variable. 
Choosing e^^^ or e^^ as evaluation criterion, it is possible to compare subsets of 
different dimensions, leading to sharper bounds and further pruning of the search 
tree. 



4. Simulation Experiments 

The algorithm described in the previous section, was implemented in the 
programming language. The evaluation criterion chosen was e^^^ In order to 
compare its computational effort for different values of P, the following simulation 
experiment was performed. 

For a number of potential variables (P), ranging from 10 to 30, and population 
Mahalanobis distances (A) of 1 , 2 and 3, balanced training samples of 50 
observations per group, were generated from two multivariate normal populations 
N(0,Z), N(|Li2,Z). The common matrix of variances-covariances, Z, was built 
assuming that all correlations between Xi, X 2 , ..., Xp could be explained by a three- 
factor model with total communalities of 0.80 for all variables. The signs of the 
correlations between the variables and the factors and the relative importance of 
each factor in the total communalities were assigned at random. The mean vector 
in group 2 ( 42 ), was controlled, in order to achieve the desired level of A, assuming 
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equal discriminatory power for all variables. Equal priors were assumed through 
the whole experiment. 

This design tries to replicate a reasonably realistic, non-trivial correlation structure, 
while keeping everything else as simple as possible. The choice of variables with 
equal discriminatory power was done in order to illustrate a worse-case scenario 
(in terms of computational effort). 

In each replication of the experiment, after searching for the best 20 variable 
subsets according to e^^\ statistics were collected on: a) The proportion of 

subsets that had to be evaluated (P.E.S.); b) The number of floating point 
operations required for the successive computations of (N.F.P.O) 

For each combination of P and A, the experiment was replicated five times. The 
minimum median and maximum values of these statistics, for P=10, P=20 and 
P=30 are shown in Table 1. 



Tab: 



e 1 : Computational Statistics for selected values of P 



p 


A 


min 


P.E.S. 

me 


max 


min 


N.F.P.O. 

me 


max 




1.0 


29.03% 


31.18% 


41.35% 


3,619 


3,896 


4,494 


10 


2.0 


22.97% 


31.57% 


38.22% 


3,536 


3,945 


4,398 




3.0 


21.02% 


25.51% 


39.20% 


3,374 


3,676 


4,484 




1.0 


0.52% 


0.82% 


1.29% 


117,143 


223,077 


373,393 


20 


2.0 


0.78% 


1.21% 


1.69% 


241,200 


297,740 


444,723 




3.0 


0.88% 


1.59% 


2.35% 


265,433 


352,001 


557,660 




1.0 


0.01% 


0.02% 


0.05% 


7,462,258 


11,794,926 


29,135,755 


30 


2.0 


0.01% 


0.03% 


0.10% 


7,921,745 


18,956,392 


45,606,974 




3.0 


0.02% 


0.05% 


0.14% 


8,636,676 


26,129,459 


57,380,001 



This limited evidence, shows that the proportion of subsets that have to be 
evaluated, has a strong tendency to decrease as P grows. Therefore, although the 
number of floating point operations still increases exponentially with P, it no longer 
doubles at each additional variable. 

All the problems considered in this experiment were analyzed with a 486DX2 
Personal Computer with 16M of RAM, and no problem required more than 20 
minutes of CPU time. 



5. Conclusions 

The variable selection process in predictive Discriminant Analysis studies should be 
based on estimators of prediction power. For two-group problems, when the 
classical assumptions hold, parametric estimators of error rates are known to have 
good statistical properties and can be used as criteria for variable selection. With 
these criteria, all-subsets analysis can and should be conducted as long as the 
number of potential variables is not “very large” . When the classical assumptions 
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do not hold, non-parametric estimators of the error rate might have better 
properties. However, the computational effort required for an all-subsets analysis 
based on non-parametric estimators, may be prohibitive. In that case, the choice of 
selection criterion should balance the statistical advantages of non-parametric 
estimators against the dangers of stepwise analysis. 



References 

Furnival, G.M. 1971. All Possible Regressions with Less Computation. 
Technometrics, 13: 403-408. 

Furnival, G.M. & Wilson, R.W. 1974. Regressions by Leaps and Bounds. 
Technometrics, 16; 499-511. 

Huberty, C.J. 1994. Applied Discriminant Analysis, New York, NY: Wiley. 

Huberty, C.J. & Wisenbaker, J.M. 1992. Variable Importance in Multivariate 
Group Comparisons. Journal of Educational Statistics, 17: 75-91. 

Jain, A.K. & Waller, W.G. 1978. On the Optimal Number of Features in the 
Classification of Multivariate Gaussian Data. Pattern Recognition, 10: 365- 
374. 

Lachenbruch, P.A. 1968. On Expected Probabilities of Misclassification in 
Discriminant Analysis, Necessary Sample Size, and a Relation with the 
Multiple Correlation Coefficient. Biometrics, 24: 823-834. 

McCabe, G.P. 1975. Computations for Variable Selection in Discriminant 
Analysis. Technometrics, 17: 103-109. 

McKay, R.J. & Campbell, N.A. 1982a. Variable Selection Techniques in 

Discriminant Analysis I. Description. British Journal of Mathematical and 
Statistical Psychology, 35: 1-29. 

McKay, R.J. & Campbell, N.A. 1982b. Variable Selection Techniques in 

Discriminant Analysis II. Allocation. British Journal of Mathematical and 
Statistical Psychology, 35: 30-41. 

McLachlan, G. J. 1974. An Asymptotic Unbiased Technique for Estimating the 
Error Rates in Discriminant Analysis. Biometrics, 30: 239-249. 

McLachlan, G. J. 1992. Discriminant Analysis and Statistical Pattern 
Recognition, New York, NY; Wiley. 
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Multinomial Samples 
with Applications to Textual Data 
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Abstract: We develop the canonical discriminant analysis of the G groups data 
consisting of «!,•••, multinomial samples within each group, on scaled 
Euclidean space with chi-square distance. Our discriminant analysis produces 
quantification plots showing both q observation categories and Y (= + *** + ri(j) 

multinomial sample units, as well as G group centroids. We apply the 
proposed method to Korean text analysis to extract statistical characteristics of 
Korean language by genres. 

Keywords: Canonical Discriminant Analysis, Multinomial Samples. 



1. Aim and Background 

For the G groups data of continuous ^-variate observations, canonical 
discriminant analysis (CDA) is well developed and used frequently in practice: 
see Johnson and Wichem (1992) for instance. But, when all the observation 
units are multinomial samples with q observation categories, we need to 
develop a new version of CDA respecting the categorical nature of measured 
data. That situation arises, for example, when we study the statistical 
characteristics of texts by writers or by genres. 

In the past, such data have been analyzed either by correspondence analysis 
or by Hayashi's quantification method III, that respects the categorical nature of 
obtained data but ignores the group membership of sample units, in achieving 
dimensional reduction. For instance, see Jin (1994) who studied on the 
positioning of commas in Japanese literary texts written by three famous 
novelists. 



2. Geometry and Algebra of Multinomial Samples in 

Denote the j-th multinomial sample in the group i by 

4 )', / = Yn,=N, 
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where is the observed frequency of category k contained in that 
sample {k = \, • , q) . Then, by stacking /^/ , / = 1, • , G ; 7 = 1, • • • , «, in 
lexicographic order of i and y, we have the x ^ frequency data matrix F, 
Transform q x 1 frequency vectors to probability vectors P,, by 

p. =(/,M ••• ,4 )' If ,,. , f,. = Z 4 ; 1., = 1 • 

k=\ 

So, P,j = 1, ••• ,G; 7 = 1, •••,«, ) are N points on (^- 1 )-dimensional 
simplex in . Then, as in Greenacre and Hastie (1984), we may impose the 
geometry on the Euclidean space containing N point elements Py with 
weights proportional to fij^ , i.e. 

=f„. /Z Z4+ ■ 

/=i 7=1 

and the inter-points distances in chi-square sense, defined by 

Pi, Pm) = i Pi - P. y ( Pi - Pm) , 
where c is the aggregated multinomial probability vector, i.e. 

■ ‘^=Z Z4 ' Z Z-4. =Z Z’^-/ P'l 

I =\ y = I /=1 7=1 /=1 7=1 

and is the diagonal matrix with c as diagonal. 

To locate the ^-dimensional points P,, in smaller dimensional subspace, 
consider the projection on ^ x 1 vector a ^ 0 . That is 

ipj D^' a / a' D~' a) a = x,j a , / = 7=1, . 

Thus, the objective is maximizing the ratio of weighted sum of squares between 
groups (SSB) to weighted sum of squares within groups (SSW): 

_ _ ii n, _ _ 

max Yj Z^v ^ )' ^ Z Z% )'■ 

l=\ J=\ l=\ 1=1 

It is not difficult to verify that 

SSB = Y Y.^., {Cp, -~P )'4 «}'/(«' D:'aY 

/=! 7=1 

= a' D~' {P,., - py (/>„. - P)D~y a ! {a‘ Z);' a f , 

SSW = Z Zm.,^ { (p„ - ~p, yD-y a (a' D;' a f 

/=1 7=1 

= a' Df {P - P„ y {P - P„ )Z);' a / (a‘ Df a f , 



where 
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Therefore, the objective can be formulated as 



where 



max a'Ba subject to a' W a = constant, 
B =D-'(P,, - Py - P)D-' , 

w =d;' {P - p,.y D^(P - )/);'. 



( 1 ) 



Using Lagrange multiplier method, (1) leads to solve 
Ba = X Wa . 

Since rank (W) = r <q, decompose W into F, A, fj' , for V^ : qxr matrix 
with orthonormal columns and Af .rxr diagonal matrix. Then 



V^' Ba = A Ai"''^ F,' F, A, V,' a =X A/" F,' a . 

By letting K/ a imposing the condition q = ^ 2 ' a ^ where 

F 2 : q^(q-r) matrix with complementary orthonormal columns, we have 



Hence 



T7 A -1/2 

F|A, u. 



( 2 ) 



A,-''" f; B F, a,'''" u =X u 



Therefore, A and u are eigenvalues and eigenvectors, respectively, of 



F/ B F A,"*''. 



And, a can be obtained via (2). 



3. Biplot Representation 

We may apply the biplot representation, originally due to Gabriel (1971), in 
the following way G, j = l k = l--,q). 

Sample Units: (P,j ~ P )‘ a I Dj a. 

Group Centroids: (p^ - p )' a / a' D~^ a , 

Obs. Categories: - p a / a' D~^ a . 

Where e,^ = ( 0, • • , 1 , • • • , 0 )' , of which only the k -th position is filled by 1 . 

The reduced dimensional quantification plots will be helpful in the visual 




236 



cognition of the multinomial samples, in reference to observation categories. 
Different scales for dimensional axes can be useful for proportionate 
expression, by re-scaling a requiring a^T a = \, where 

t = = (p - py (P - P)D-\ 



4. An Example : Korean Text Data Analysis 

We have collected textual data on 36 genres and around ten texts per genre. 
A subset data in Table 1 is used for methodological demonstration: three genres 
(G=3) of Korean texts (autobiography A, popular magazine P, and newspaper 
article N), three sample texts («i=W2=^3=3, iV=9) from each. Measured 
variables are tallies of seven parts of speech such as adverb, noun, suffix, verb 
and so on (^=7). The two-dimensional quantification plots are shown in Figures 
1 and 2, of which the first eigenvalue is 58.2 and the second one is 2.9. 



Table 1 : Korean Texts Classified by Genres and Parts of Speech 







Adverb 


"Asc” 

(Mark) 


"Eomi"" 

(Ending) 


"Josa" 

(Aux.) 


Noun 


Suffix 


Verb 


Auto- 

Biography 


Sample 1 


136 


160 


530 


493 


776 


130 


339 


Sample 2 


181 


122 


546 


393 


664 


104 


390 


Sample 3 


120 


271 


491 


458 


841 


160 


330 


Popular 

Magazine 


Sample 1 


91 


247 


571 


490 


840 


186 


362 


Sample 2 


112 


237 


564 


436 


725 


171 


387 


Sample 3 


120 


286 


477 


492 


853 


232 


293 


Newspaper 

Articles 


Sample 1 


54 


174 


242 


297 


853 


109 


117 


Sample 2 


66 


206 


406 


446 


908 


191 


211 


Sample 3 


51 


270 


314 


351 


904 


128 


162 



From Figure 1, we see that Genre P (Popular magazine) is positioned between 
Genre N (Newspaper article) and Genre A (Autobiography) on the first axis. By 
superimposing Figure 2 over Figure 1, we also see that Genre A can be typified 
by the frequent use of verbs and Genre N is by “Eomi” (endings). 
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Figure 1 : Sample Units 
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Figure 2: Observation Categories 




dill 






How to extract predictive binary attributes from a 

categorical one 



I. C. Lerman 
IRISA-INRIA 
Rennes, France 
lerman@irisa.fr 



J. F. Pinto da Costa 
Dep. Matematica Aplicada & LIACC 
Universidade do Porto, Portugal 
jpcosta@ncc.up.pt 



Abstract: In this work, we present new ways of dealing with categorical attributes; 
in particular, the methodology introduced here concerns the use of these attributes in 
binary decision trees. We consider essentially two main operations; the first one 
consists in using the Joint distribution of two or more categorical attributes in order 
to increase the final performance of the decision tree; the second - and the most 
important - operation, concerns the extraction of relatively few predictive binary 
attributes from a categorical attribute; specially, when the latter has a large number 
of values. With more than two classes to predict, most of the present binary decision 
tree software needs to test an exponential number of binary attributes for each 
categorical attribute; which can be prohibitive. Our method, ARCADE, is 
independent of the number of classes to be predicted, and it starts by reducing 
significantly the number of values of the initial categorical attribute. This is done by 
clustering the initial values, using a hierarchical classification method. Each cluster 
of values will then represent a new value of a new categorical attribute. This new 
attribute will then be used in the decision tree, instead of the initial one. 
Nevertheless, not all of the binary attributes associated with this new categorical 
attribute will be used; only those which are predictive. The reduction in the 
complexity of the search for the best binary split is therefore enormous, as will be 
seen in the application that we consider; that is, the old and still lively protein 
secondary structure prediction problem. 

Keywords: Decision trees, categorical attributes, binarization, hierarchical clustering 



1. Introduction 

In our work, the construction of binary decision trees is based on a learning set 0 \ 
and a test set Ot . 0 \ is used to construct a very large tree, to find a sequence of 
pruned trees, and also to choose a tree from this sequence. The test set Ot is used to 
estimate the true performance of the chosen decision tree. The methodological 
developments that we introduce in this paper concern the construction phase; more 
precisely, the splitting of a categorical attribute. The splitting criterion used by us 
was the Gini coefficient. The pruning phase is the same as in CART [Breiman & al., 
1984], although we do not use the test set to choose the final decision tree; instead, 
we split the learning set into two disjoint parts. 
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Let us now suppose that we have a categorical (nominal) attribute v with L 
values and that we have K classes to be predicted. In the construction of a binary 
decision tree, each attribute must be binarized; that is, we must consider all of the 
binary attributes associated with a given predictive attribute and, at each node of the 
decision tree, choose the best binary attribute. The main problem with categorical 
attributes is that they originate an exponential number of binary attributes ( 2^^‘^^-l), 
corresponding to all two-class partitions of the L values. If K=2, the authors of the 
celebrated method CART show how to find the best binary split originated by v in 
0(L.log(L)). Unfortunately, if the number of classes is greater than two, the 
complexity of the present version of CART is always exponential in L. However, it 
is possible to select the same binary attribute with a complexity 0(2^.i/.log(L))., 
which can be very interesting if K is smaller than L. This can be done by using the 
super-class procedure discussed in CART book [Breiman & al., 1984], and then, for 
each division into two super-classes, choose the best binary split in 0(L.log(L)) with 
the Gini criterion. [Asseraf, 1996], by generalizing the Kolmogorov-Smimov 
distance to the case of categorical variables, has shown how to choose the optimum 
split in O(L.K^). 

In our application, the initial predictive attributes were nominal (categorical), 
each with 20 values, and there were three classes to be predicted {K=3). In order to 
clarify the exposition, we will now briefly describe the application. 

2. Protein Secondary Structure Prediction Problem 

The primary structure of a protein is formally a word taken from an alphabet of 20 
letters (corresponding to the 20 amino acids A,T,P,...,L,M) and which length can 
reach many hundreds. The secondary structure of a protein can also be formally 
defined as a word of the same length, but the corresponding alphabet contains three 
letters, £*, H and X, which are the names of the three concepts to be discriminated; H 
corresponds to an helix a, £* to a strand p and L stands for loop (see figure 1). 

Figure 1 : Example of the primary and secondary structures of a protein 
... T T C C P S I VARS... 

EEEEXXHH H H H 



Although the number of known primary sequences increases extremely fast, 
the same is not true for the secondary structure, and so there is a widening number of 
protein sequences whose secondary structure needs to be predicted. Let us now 
suppose that we have a primary sequence like the one in figure 1, and that we want 
to predict the class (£*, H ov X) corresponding to the letter (amino acid) S in the 
middle. Nevertheless, the prediction does not only depend on the letter S, but also on 
its neighborhood; and so, we have considered a window of size eleven (which we 
have found to be optimal), centered on the position to be predicted. To take account 
of the information present on that window, we have started to define 1 1 categorical 
attributes, vi, V 2 , ...,vn ,each with 20 values, because there are 20 different amino 
acids. For example, if we did not know the secondary structure in figure 1 and we 
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wanted to predict, for instance, the class (E, H ox X) corresponding to the letter S, 
the attribute vi would take the value T, and the attribute V6 the value S (see figure 2). 



Figure 2: The initial predictive attributes vi, V 2 , ...,vn 
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Our data file is constituted by 151 non-homologous globular protein sequences 
[Colloc'h et al. 1993], which sizes range from a dozen residuals to almost nine 
hundred. The total number of residuals, that is, of amino acids to be predicted, is 
about 30,000 (46.6% X, 29% H and 24.4% £). We have therefore a data file with 
about 30,000 objects described by 11+1 categorical variables, and we want to learn 
from it how to predict the last variable. 

3. The Results Obtained With CART and C4.5 

We have started by considering the application of CART [Breiman & al., 1984] to 
this data. We must remember that at each node of the decision tree, and for each one 
of the eleven categorical attributes with 20 values, CART will test binary 

attributes in order to choose the best one. Although this is computationally 
expensive, the results obtained by us were 54.3% of correct predictions (using 63% 
of the data to learn and 37% to test). We have also considered the use of C4.5 
[Quinlan, 1993] and the results were 52.5% of correct predictions (10-fold-cross- 
validation). Unlike CART, which always finds the best partition of the values into 
two classes (the best binary attribute), C4.5 has an option, -s, which also groups the 
values into classes; but not necessarily into two classes. Using this option, which is 
computationally expensive, the results of C4.5 were 47.3% of correct predictions 
(1 0-fold-cross-validation). 

The attributes vi, V2, ...,vn, although natural to consider, do not take into 
account the order between the different amino acids inside the window; nevertheless, 
this order is most important for prediction purposes. To take into account this order, 
we have decided to start by taking two adjacent amino acids at a time, like for 
instance the pair (4,5), which, in figure 2, is (C, P). This corresponds to use the joint 
distribution of the variables V4 and V5. As there are 400 (20x20) different pairs of 
amino acids, we were obliged to consider new categorical attributes, each one with 
400 values. Using CART on these attributes is obviously prohibitive binary 

attributes for each ). Without using the option -s, we have applied C4.5 to these new 
attributes, which took very long time to run; the results were 52.6% of correct 
predictions (10-fold-cross-validation). The next step consisted in using the amino 
acids three at a time, four at a time, and so on. Obviously, if we take s amino acids at 
a time, the corresponding attribute will have 20^ values. The best results were 
obtained for the case = 4. In this case, our predictive attributes have 20"^ = 160.000 
values each, and therefore neither CART nor C4.5 are applicable. This was our 
motivation to develop a new method, ARCADE, which builds binary decision trees, 
and is capable of treating categorical attributes with a huge number of values. 
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4. The ARCADE Method 

The ARCADE method [Pinto da Costa, 1996], [Lerman & Pinto da Costa, 1996], by 
grouping the values of each categorical (nominal) attribute v into clusters, allows the 
definition of another qualitative attribute with much fewer values. For example, let 
us consider an attribute w with 8 values mi, m 2 , ...,mg. By grouping these values 
into three clusters, for example {mi, ms, m 3 }, {m 2 , mg}, {ni 4 , m 6 , mv}, we can define 
a new nominal attribute, w , with three values. In order to make significant this new 
attribute, the values in the same cluster have to be similar, with respect to their 
statistical behavior on the classes to be predicted. In these conditions, a contingency 
table crossing the attribute w with the attribute c to be predicted, is established. The 
application of the hierarchical classification method AVL [Lerman, 1993] to the 
rows of this contingency table allowed us to find a significant partition of the values 
into clusters. For example, let us consider the contingency table crossing our attribute 
V (160.000 values) with c (3 values). 



Table 1 : contingency table crossing v with c 
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This is therefore the main idea of our method: clustering the values of a 
predictive categorical attribute in order to reduce the complexity. ???? However, in 
case where the number of values is to large (160.000 in our case), the direct 
utilization of a clustering method on the row set of the contingency table becomes - 
for complexity and statistical reasons - neither tractable nor significant; on the other 
side, these attributes with 160.000 values correspond to the case .y = 4 above. Each 
one of these attributes v is therefore the "multiplication” of two attributes, and 
, both with s = 2 . Thus, we establish two contingency tables for these two latter 
attributes, each one with 400 rows. 

Table 2: contingency table crossing v’ with c Table 3: contingency table crossing with c 
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We apply AVL to the rows of both contingency tables, in order to get two significant 
attributes v * and v both with much fewer values (for instance, 3 1 values each). 
We then ’’multiply" (that is, consider the joint distribution) these two latter attributes, 
getting a new attribute v with 31x31=961 values. The contingency table crossing 
this latter attribute with the attribute to be predicted is established, and AVL is 
applied to its 961 rows. A significant partition with, for instance, 12 classes is 
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Jtained, and the values of the final attribute correspond to the classes of this final 
artition. Our initial attribute, with 160.000 values, was therefore converted into an 
ttribute with 12 values, by means of three hierarchical classifications. We could 
ow consider constructing a decision tree on these final attributes. Nevertheless, the 
alues of these final attributes are hierarchically structured, because we have applied 
n hierarchical clustering algorithm (see figure 3). We profit from this and, instead of 
onsidering all of the binary attributes, we consider only about twenty, 

orresponding to the leaves and nodes of the hierarchical tree (see figure 4). 



Figure 3 : Hierarchical clustering tree on the final 12 macro-values 




Figure 4 : The binary attributes used by ARCADE 

1 : (ai = 1 in { ^i}) and (ax = 0 in { ei, e^, e^, es, ee, ej, eg, eg, exo, exx, exi}) (leaf) 

’2 : (a2 = 1 in { 62}) and (02 = 0 in { ex, e^, e^, es, ee, e^, eg, eg, exo, exx, ^12}) (leaf) 

fi2 : (ax2 = 1 in { ^12}) and (ax2 = 0 in { ex, e2, €3, 64, es, e^, ei, eg, eg, exo, exx}) (leaf) 
fi3 : (ax3 = 1 in { ^4, e-j}) and (ax3 = 0 in { ex, e2, ez, es, ee, eg, eg, exo, exx, ^12}) (node 1) 

fi4: (ai4= 1 in { ex, ^3}) and (ax4 = 0 in { ^2, ^4, es, ee, e^, eg, eg, exo, exx, ^12}) (node 2) 

f2o: (a2o= 1 in { ee, eg, 6x2}) and (^20 = 0 in {ex, e2, ^3, ^4, es, ej, eg, exo, exx}) (node 8) 

(21: (a2x = 1 in { ^4,^7, es, exo, eg, ^2}) and (^21 = 0 in {^n, ex, es, ee, eg, e^}) (node 9) 

"hus the binary attributes that ARCADE chooses are a\, a 2 , a 2 \ (see figure 4). 

The reduction in complexity is therefore enormous: -> -> 

« 20. This last part of our method is similar to the Bit-Per-Category 
incoding scheme (see [Almuallim & al., 1995] for a description of this and other 
nethods for dealing with tree-structured attributes). In our method, however and by 
lifference, the main contribution is in the construction of a tree-structured attribute, 
*rom a nominal one; whereas in the work of [Almuallim & al., 1995], and other 
nethods, the initial predictive attributes are already tree-structured. 
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5. Results and Conclusions 

In this work we have set up a method, ARCADE, for extracting a few number of 
predictive binary attributes from a categorical one. This method is specially useful in 
the case of attributes with a large number of values. Nevertheless, even in the case 
of an attribute with a small number of values, the application of ARCADE can be 
interesting because it defines a tree-structure on these values. The aim of ARCADE 
is here situated on the context of the construction of a binary decision tree. However, 
our method is independent of the prediction method used and then it can be 
employed by any other prediction method based on binary attributes. The application 
of ARCADE to the protein secondary structure prediction problem, resulted in 65.1% 
of correct predictions, which is amongst the best results. The highest performance, 
for another data-set different from ours, was obtained for the method PHD [Rost & 
Sander, 1993], which is a combination of several neural networks. This method uses 
an additional information, which we do not use, given by the multiple sequence 
alignments of homologous proteins. If a new protein has no homologous proteins, the 
performance of PHD fells considerably. 
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Abstract: In this paper a brief review of the main contributions to projection 
pursuit discriminant analysis is presented. A new procedure, based on distance 
measures between probability densities, is developed in order to obtain linear 
discriminant functions, which best separate different populations. In the two- 
group case, Matusita’s distance between the projected population density 
functions is adopted as projection index; while, in the multi-group case, a 
monotone transformation of Matusita’s affinity coefficient is employed to 
measure the separation among the marginal probabihty density functions of the 
different populations. Simulation studies stress the efficacy of the proposed 
method in comparison with classical parametric ones and with the projection 
pursuit based linear discriminant procedure developed by Posse. 

Key words: Matusita’s Affinity Coefficient, Projection Pursuit, Discriminant 
Analysis, Matusita’s Distance Coefficient. 



!• Introduction 

Projection pursuit (Friedman and Tukey, 1974) is a multivariate statistical 
method whose goal is the search of the most interesting low-dimensional linear 
projections of a multidimensional data set, in order to detect the non-hnear 
structure underlying the data. 

The most important feature of projection pursuit is its ability to overcome the 
“curse of dimensionality” caused by the fact that high-dimensional space is 
mostly empty. Projection pursuit avoids this problem by working on low- 
dimensional linear projections. Furthermore, projection pursuit is able to detect 
and ignore irrelevant variables (Huber, 1985). 

In this paper attention is focused on projection pursuit discriminant analysis. 
Different solutions to discrimination and classification problems based on 
projection pursuit methods have recently appeared in the statistical hterature. 
The different proposals can be grouped into three main approaches. The first one 
consists in applying projection pursuit regression methods in order to model 
either the posterior probability distributions of group membership or the 
likelihood ratio (Henry, 1983; Friedman, 1985; Fhck et al., 1990; Roosen and 
Hastie, 1993). The second one is based on projection pursuit density estimation 
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methods in order to approximate the probability density functions underlying 
the data in each population (Polzehl, 1995; Calo and Montanari, 1997). The 
third approach consists in constructing linear discriminant functions by 
optimising some suitably chosen projection index. Posse (1992) determines 
linear discriminant functions by minimising the total probability of 
misclassification, while Chen and Muirhead (1994) construct robust linear 
discriminant functions by inserting in Fisher’s ratio of between class to within 
class variation different robust location and scale estimators. 

The solution examined in this paper belongs to this last approach: we select the 
linear discriminant functions, which maximise an index measuring the 
separation between the groups. Separation is defined in terms of distance 
coefficients between the projected population density functions. 



2. Projection pursuit discriminant analysis based on Matusita’s 
distance coefficient 

In the context of projection pursuit discriminant analysis, the most interesting 
linear projections of the multidimensional data set are those which permit to 
obtain the best separation among the groups. 

In this paper we propose to measure the distance between two populations by 
Matusita’s (1956) coefficient, also referred to as Hellinger's distance. Being X a 
/7-dimensional random vector with probability density functions and 

/ 2 (jc) in two distinct populations tij and 7i2 , respectively, Matusita’s distance 
between the two multivariate populations and 7 I 2 , D{t^ p 712 ) , is defined as 

where P= jylfi W /2 {x)dx is the affinity coefficient, introduced by 
Bhattacharyya (1943), measuring the similarity between the two probability 
density functions f\{x) and f 2 {x), taking values in the [0,1] interval. 
Matusita’s distance is a non-negative measure, it equals zero if and only if 
fi= fil it is symmetric and it satisfies the triangular inequality. 

Matusita’s distance between two multivariate density functions has been 
exploited so far only in the context of parametric discriminant analysis (see 
Krzanowski 1988, for references). 

In order to develop a projection pursuit discriminant analysis algorithm based 
on Matusita’s distance, let Y = & X be the linear projection onto the 
unidimensional space spanned by the unit /?- vector 0 and let /je(y) be the 
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marginal probability density function of X in the direction 0 if X comes from 
population 

The linear discriminant function which shows the best separation between 
and 712 , in l^nns of Matusita’s distance, can thus be defined as the direction 6 
maximising: 



£>9 (n, , 712 ) = V j [V/ie (3^) " V^e iy)t dy=p{l-pe) 



( 2 ) 



where Pe = | V/ie(3')/2e(3'V)' • 

The sample version of this projection index is obtained by replacing the 
unknown densities in (2) with their kernel estimates: 






(A 



n2 



dy 



(3) 



Let be a sample from the population tt, (/ = 1,2). The univariate 

kernel estimate (Silverman, 1986) of the marginal density function of the i-th 
group is defined by: 



/,e (yh r i ^((y - e '^9 )/h, ) (4) 

;=i 

where X( ) is a kernel function, usually a symmetric probability density function, 
and hi is the window width (one for each population), which controls the 
degree of smoothness of the estimate. 

Once the unit vector 0 q, maximising the index (3), has been found, the 
allocation rule, proposed in order to classify a new observation, x, to one of the 
possible populations, consists in computing the linear combination y = 6 qX , and 
assigning X to if /iQ^(y)> to 7 I 2 otherwise. 

Let be the probability density functions of the p-dimensional 

random vector X in G distinct populations (G>2). Matusita (1967) 

developed an extension of the affinity coefficient p , measuring the similarity 
between defined as: 

Pg =jlfi(x)-- (5) 
Click (1973) proposed the following transformation of as distance measure: 
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DG(ni,...,7to)=l-pg^^ (6) 

It is symmetric, invariant under permutation of its arguments, non-negative and 
it equals zero if and only if = ... = . The measure of total separation among 

the G populations is not less than the separation between any two of them. 

The projection index based on this distance measure is defined as: 

DGeinr....,na)=l-\iy^M-foM'''dyf'" (7) 

The empirical version of (7) is obtained by estimating the unknown univariate 
density functions by means of kernel density estimators (4), on the basis of G 
samples (/ = 1,...,G). 

Let 6 q be the unit vector which maximises this projection index. The allocation 
rule consists in assigning a new observation x to the i-ih population if 
/,(6o'j:)>/y(eo'^) foreach j^i {i, j = 1...,G). 



3. Examples and conclusions 



In order to test the performances of our projection pursuit discriminant analysis 
algorithm we have considered three situations described in Polzehl (1995). 



SI: 






+ i^2((-V3 ijja) 

+ 1a^2((-V3 -1)',/2) 



n^.LogN^dO 0 )' , 0 . 5 / 2 ) 



Situation S3 is obtained by adding two noisy variables to situation SI. It is used 
to study the behaviour of the procedure with an increasing number of 
dimensions. For each situation we have generated 100 training samples of size 
150 and test samples of size 900 for each population. In our applications we 
have used a standard gaussian kernel whose window width has been chosen by 
Sheather and Jones’s (1991) automatic selection method. Numerical 
optimisation has been carried out through the random search algorithm 
developed by Huber (1990). We have computed the mean percent error rates 
(percent error rates averaged over the 100 replications) both on the training 
samples and on the test samples and we have compared the results of our 
method (PPDA) with those obtained by linear discriminant analysis (LDA), 
quadratic discriminant analysis (QDA) and by Posse’s algorithm (POSSE). 
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Table 1; Mean percent error rates computed on the test (training) samples 





LDA 


QDA 


POSSE 


PPDA 


SI 


50.1 (48.8) 


49.7 (46.6) 


38.9 (31.3) 


37.3 (33.9) 


S2 


50.9 (48.0) 


48.7 (45.9) 


37.0 (31.1) 


36.4 (32.9) 


S3 


50.0 (46.6) 


50.3 (43.1) 


39.9 (30.6) 


38.6 (33.8) 



Our totally non-parametric procedure performs far better than the classical 
parametric ones in all of the situations considered (see Table 1) and seems to be 
a vahd competitor to the projection pursuit based procedure developed by Posse 
(1992). Aimed at the rninimisation of the total probability of misclassification 
Posse’s algorithm is the best on the training samples, but, when evaluated on the 
test samples, it is outperformed by our procedure. In the case of normal 
populations with equal covariance matrices, the projection directions, optimising 
both indices, approximate the coefficient vector defining Fisher linear 
discriminant function. Otherwise, the different behaviour of the two indices is 
due to the different definition of separation: our index measures the distance 
between the projected probability density functions and takes into account their 
different shapes over the whole domain, while Posse’s index depends only on the 
overlapping area between them. 

It is worthwhile to note the ability of projection pursuit to ignore irrelevant 
variables. In all the replications of situation S3, the solution vector is about 
Gq = (0.866,-0.494-0.009,-0.083). If we scale the absolute values of these 
coefficients with the standard deviations of the corresponding variables and 
normalise the resulting vector, we obtain the importance coefficient vector 
(Montanari, Lizzani 1996) C/ = (0.866,0.497,0.005,0.047). By comparing each 
component of this vector, which measures the discrimination power of each 
variable, with the 10-th percentile of the uniform distribution on the unit p- 
sphere (0.078 in this case), we can deem irrelevant the contributions of the last 
two variables, which represent noisy terms. The 10-th percentile of the uniform 
distribution on the unit p-sphere is chosen as the cut-off value because, when 
f\- the directions on the unit p-sphere are equally uninteresting and 

have the same probability to be chosen as solution vectors by the projection 
pursuit algorithm. 



References 

Bhattacharyya, A. (1943). On a measure of divergence between two statistical 
populations defined by their probability distributions. Bulletin of the Calcutta 
Mathematical Society, 35, 99-109. 

Calo, D. & Montanari, A. (1997). An empirical discrimination algorithm based 
on projection pursuit density estimation, Book of Short Papers, Classification 







250 



and Data Analysis, Meeting of the Classification Group of Societa Italiana di 
Statistica, Pescara, 3-4 luglio, 65-68. 

Chen, Z. & Muirhead, R.J. (1994). A comparison of robust linear discriminant 
procedures using projection pursuit methods, in Multivariate Analysis and its 
Applications, IMS Lecture Notes-Monograph Series, Hayward, 24, 163-176. 

Flick, T.E. & Jones, L.K. & Priest, R.G. & Herman, N.C. (1990). Pattern 
classification using projection pursuit. Pattern Recognition, 12, 1367-1376. 

Friedman, J.H. (1985). Classification and multiple regression through projection 
pursuit. Technical Report 12, Laboratory for Computational Statistics, 
Department of Statistics, Stanford University. 

Friedman, J.H. & Tukey, J. (1974). A projection pursuit algorithm for 
exploratory data analysis, IEEE Transactions on Computers, 23, 881-889. 

Glick, N. (1973). Separation and probability of correct classification among two 
or more distributions. Annals of the Institute of Statistical Mathematics, 25, 
373-382. 

Henry, D.H. (1983). Multiplicative models in projection pursuit, Ph. D. Thesis, 
Stanford Linear Accelerator Center, Stanford University. 

Huber, P.J. (1985). Projection pursuit. The Annals of Statistics, 13, 435-475. 

Huber, P.J. (1990). Algorithms for projection pursuit. Technical Report 3, 
Department of Mathematics, M.I.T. Cambridge. 

Krzanowski, W.J. (1988). Principles of multivariate analysis, Oxford Science 
Publication, New York, 359-360. 

Matusita, K. (1956). Decision rule, based on distance, for the classification 
problem. Annals of the Institute of Statistical Mathematics, 8, 61-11 . 

Matusita, K. (1967). On the notion of affmity of several distributions and some 
of its apphcations. Annals of the Institute of Statistical Mathematics, 19, 
181-192. 

Montanari, A. & Lizzani, L. (1996). Projection pursuit e scelta delle variabili, 
Atti della XXXVIII Riunione Scientifica della Societa Italiana di Statistica, 
2, 591-598. 

Polzehl, J. (1995). Projection pursuit discriminant analysis, Computational 
Statistics and Data Analysis, 20, 141-157. 

Posse, C. (1992). Projection pursuit discriminant analysis for two groups. 
Communications in Statistics - Theory and Methods, 21, 1-19. 

Roosen, C.B. & Hastie, T.J. (1993). Logistic response projection pursuit 
regression. Statistics and Data Analysis Research Department, AT&T Bell 
Laboratories, Doc. BL011214-930806-09TM. 

Sheather, S.J. & Jones, M.C. (1991). A reliable data-based bandwidth selection 
method for kernel density estimation. Journal of the Royal Statistical Society, 
Series B, 93, 683-690. 

Silverman, B.W. (1986). Density estimation for statistics and data analysis. 
Chapman and Hall, London. 




Two Group Linear Discrimination Based on 
Transvariation Measures 

Angela Montanari Daniela G. Calo 

Dipartimento di Scienze Statistiche, Universita di Bologna, 

Via Belle Arti, 41, 40126 Bologna, Italy - e mail: montanar@stat.unibo.it 

Abstract: In this paper we derive a two group linear discriminant function 
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case of projection pursuit methods and improves the performances of Fisher’s 
LDF when the conditions which guarantee its optimality do not hold. 
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1. Introduction 

In classification issues, the two group linear discriminant function (LDF) is 
commonly used mainly because of its simple form and interpretability. 

Its origin goes back to Fisher (1936) who suggested to search directly the linear 
combination of the p measured characteristics which maximizes group 
separation, defined as the ratio of “between” to “within” group variance under 
homoscedasticity constraint. It apparently does not require any distributional 
assumption, but normality or at least symmetry is actually implicitly assumed. 
Fig. 1 clearly shows this point. For skewed distributions it may in fact happen 
that Fisher’s criterion does not succeed in distinguishing between situation a) 
and b): they have the same “between” to “within” group variance ratio, while 
from a discrimination perspective case a) is clearly preferable. 

Furthermore, under normality, and of course homoscedasticity, the allocation 
rule based on Fisher’s LDF is also the one which minimizes the total probability 
of misclassification (or maximizes posterior probability of group membership). 

Figure. 1 : Two situations having the same between to within group variance 
ratio but different overlapping. 





The performance of Fisher’s LDF, when homoscedasticity and normality 
assumptions are violated, have largely been studied both theoretically and 
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empirically (see Krzanowski and Marriott, 1995, for a detailed review). All the 
studies have shown that as covariance differences increase, the behavior of 
Fisher’s function, i.e. its error rate, for moderate sample sizes deteriorates (for 
small samples Fisher’s function seems to perform better than the quadratic 
allocation rule as it involves a smaller number of parameters, but its error rate 
may still be too high for practical use). The same happens for departures from 
normality whose effect varies according to the distribution shape. 

To cope with heteroscedasticity in the normal case, while preserving the simple 
form of the linear function, Anderson and Bahadur (1962) have proposed an 
alternative they have called the “best” linear discriminant function, meaning 
that it’s the linear function which minimizes the total probability of 
misclassification. 

This idea has recently been generalized by Posse (1992), in a fully non- 
parametric context, exploiting the high potentiality of projection pursuit 
methods (Huber, 1985). In fact projection pursuit is concerned with 
“interesting” low dimensional projections of high dimensional data and if 
“interestingness” is equated to “showing the minimum total probability of 
misclassification” for any group density then Posse’s method is obtained. 

It obviously gives the best error rates as far as the classification of the units 
belonging to the training sample is concerned, but for small or moderate sample 
sizes nothing guarantees its good performance on new cases whose group 
membership is to be determined; in other words, it may be derailed by a sort of 
overfitting effect. 

As an alternative distribution free solution we propose a LDF whose derivation 
is still based on the projection pursuit methodology but whose logic is closer to 
Fisher’s than to Posse’s one. Our linear discriminant function is obtained by 
maximizing group separation in terms of Gini’s transvariation. 



2. Transvariation and its measures 

According to Gini (1916), two groups g, and g 2 are said to transvariate on a 
variable y , with respect to an average value , if the sign of some of the 

differences 0 = .•.,«! y = 1 , 2 ,...,a? 2 ) which can be defined 

between the y values belonging to the two groups is opposite to that of 
~^y 2 * difference satisfying this condition is called “a transvariation” 

and “T 72 I intensity. Similarly the group {k = 1, 2) and a constant c 
are said to transvariate on y , with respect to an average value , if the sign of 
some of the (A: = 1, 2) differences y^j- c (/ = 1, 2,...,^^) is opposite to that 
of niyi^-c. 
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In order to measure the transvariation between a group and a constant or 
between two groups Gini first introduced, among the others, transvariation 
probability, transvariation intensity and transvariation area. 

The discrete version of the transvariation probability with respect to an average 
value - which Gini assumed to be the median - is defined as the ratio of the 
number of trans variations for trans variation with a constant and s ^2 

transvariation between two groups) to its maximum or Xlln^n^ 

respectively when the group character distribution is symmetric, or an higher 
value, whose maximum is 3 2/7^ rij , for skewed distributions; see Gini for a 
thorough discussion ). 

Denoted by /^(y) (A: = 1, 2) the probability density function of y in the parent 
population of group and by Fj^[y) its distribution function, and assuming 
that > rriy 2 , the continuous expression of the transvariation probability is 

+O0 

given by 2 ^F^{y)f 2 {y) ^ and the transvariation area (which only admits a 

—00 

continuous formulation) is: 

+CX) 

where T(:K)=m/>j(/,(;^),/ 2 (;^)) (1) 



The transvariation probability equals zero when the sign of the difference 
between the two medians allows to predict with certainty the sign of the 
difference between the value of ;; in any two units belonging to the two 
different groups; it increases as the probability of such a forecast decreases, and 
reaches its maximum value, i.e. 1, when the relative positions of the medians of 
the two groups do not allow any prediction. Its complement can thus be 
assumed as a measure of the typicality of the medians which summarizes 
location, scale, and shape characteristics. 

When the transvariation probability is zero, the two groups do not overlap and 
therefore the transvariation area is also zero, but the inverse is not always true 
(Gini, 1959 pag. 257). In fact for multimodal densities it may happen that the 
transvariation area is zero while the transvariation probability is not. This means 
that the two measures usually highlight different aspects of group transvariation. 
However, when y is normally distributed in the parent population of both gj 
and g 2 and homoscedasticity holds (a^ denotes the common standard 

deviation), the transvariation probability and the transvariation area are 
monotonically linked, being 



-my2 



and = 



(where 0 is the Laplace integral ©(x) 



jexp(-z^)^:/z): they therefore give 

0 



equivalent information. 
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While the previously described indices involve frequencies only, the 
transvariation intensity regards the character values too. It is defined with 
respect to the arithmetic mean and its discrete expression is the ratio of the sum 
of transvariation intensities ( 7^2 ) to its maximum. It too admits a continuous 
expression which allows to show the equivalence of transvariation intensity to 
transvariation probability and area in the normal homoscedastic case. 



3. Transvariation based discrimination and classification 

From the above discussion it emerges that transvariation measures can be 
profitably used to discriminate between two groups (having equal prior 
probabilities). 

Let X be the /7-dimensional random vector of the variables to be used in the 
discrimination process, a a /?-dimensional unit vector defining a projection 
direction and y = a' x the random variable that results from projecting x along 
a . A LDF can then be derived, in a projection pursuit framework, as the linear 
combination (y = a'x) which minimizes transvariation probability, intensity or 
area. This possibility has already been indicated by Salvemini (1959) but, as far 
as we know, it hasn’t yet been translated into an operative procedure. Projection 
pursuit offers a set up within which this solution may be pursued. 

A closer look at equation (1) shows that (in the equal prior case) the 
transvariation area is but twice the total probability of misclassification i.e. the 
quantity Posse (1992) has suggested to minimize. In the multivariate normal 
homoscedastic case Posse has shown that his projection pursuit LDF coincides 
with Fisher’s, but as trans variation probability, transvariation intensity and 
transvariation area (i.e. the total probability of misclassification) are 
monotonically linked (see (2)) minimization of the transvariation probability or 

Figure. 2 : Values of Fisher’s criterion (dotted line), transvariation intensity 
(crosses), probability (solid line) and area (bold line) as a function of the 
rotation angle for two bivariate normal homoscedastic populations. 




the transvariation intensity (with respect to a) yields Fisher’s function too. This 
is clearly represented in Fig. 2, where all the projections of two bivariate normal 



255 



homoscedastic populations have been examined. Each discrimination measure 
has been evaluated as a function of the rotation angle, which ranged, at steps of 
one degree, from 0 to 180 degrees. Fisher’s criterion attains its maximum 
exactly where all the transvariation measures are minimized. 

In different distributional conditions, however, minimizing the transvariation 
probability, the transvariation intensity or the transvariation area may give 
profoundly different results in terms of misclassification rates. In particular, 
being not affected by the overfitting effect already discussed for Posse’s 
procedure, allocation rules based on the transvariation probability and the 
transvariation intensity may perform better on a novel set of units. In the 
following applications, the discrete transvariation probability (the continuous 
one gives the same results but is computationally more time consuming) has 
been preferred due to its good robustness properties. In fact, while giving results 
similar to those obtained by transvariation probability in the no outlier case, 
transvariation intensity performances heavily deteriorate when extreme values 
are present in the data set (Montanari e Calo, 1998). 

When the discriminant function is determined by the numerical minimization of 
transvariation probability a coherent classification rule is then given by 
projecting the unit c on the obtained discriminant direction and allocating it to 
the group with which it transvariates most, that is to the group for which is 
maximum. 



4. Example and conclusions 

The performances of our transvariation based discrimination procedure 
(resorting to simulated annealing as the optimization algorithm) have been 
tested on many simulated data sets and compared to those of Fisher’s and 
Posse’s methods (a GAUSS version of Posse’s original FORTRAN algorithm 
has been employed). Here we present the results obtained on 50 replications of 
four simulation experiments involving two normal heteroscedastic six-variate 
populations (as described in Posse, 1992), and training sets of size 
n^-n 2 -\S{^ and = 100 ri 2 = 2 ^(), and two lognormal four- variate 

populations (derived from gaussian densities centered in (0,0,0,0) and (2,0,0, 0) 
and with identity covariance matrix) with the same training set sizes as the 
previous experiments. 

In the lognormal case most projections show skewed training sample 
distributions. This, in principle, would require to determine, for each projection, 
the maximum attainable value for the number of trans variations. As no closed 
form exists for it, this value should be computed numerically, thus heavily 
slowing the procedure which involves a further numeric step in the 
minimization phase. Furthermore this additional computational load could not 
yield a remarkable gain in the determination of the best discriminant direction; 
as Gini himself adfirms “in practical cases, assuming the upper limit of 
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transvariation probability to be 1 is not far from truth”. For this reason we have 
chosen to minimize the transvariation probability in its form suggested for 
symmetric distributions; this amounts to look for those linear projections along 
which the absolute number of transvariations is minimum. 

Table 1 reports the results of the simulation experiments as the mean probability 
of misclassification, and the standard deviation of the probability estimates (in 
brackets), for each rule in the different cases, evaluated on a test set of 1000 
individuals for each population. 



Table 1 : Results of four simulation experiments 



LDF 


heteroscedastic normal 


lognormal 


«j^«2=150 


«j=100 /72=200 


JS 

II 

JS 

o 


/?j=100 «2=200 


Fisher 

Posse 

Transvariation 

based 


0.149 (0.00966) 
0.143 (0.01643) 
0.158 (0.01182) 


0.163 (0.01213) 
0.165 (0.01934) 
0.161 (0.01152) 


0.254 (0.02016) 
0.243 (0.01787) 
0.165 (0.01023) 


0.258 (0.02585) 
0.271 (0.01888) 
0.165 (0.01074) 



The simulation studies have shown that in the heteroscedastic normal case 
transvariation based discrimination, Fisher’s function and Posse’s one yield 
similar error rates (evaluated on a test set); while in the non normal case - that is 
when non parametric methods become necessary - transvariation based 
discrimination largely outperforms both of the other methods. 
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Abstract: In this paper, we propose a new clustering method and a new 
discriminant rule valid on the basic space . These procedures make use of a 
new concept in clustering: the concept of closed and connex forms. This 
hypothesis generahzes the convex hypothesis and is very useful to find non 
convex natural clusters. Finally, we examine the admissibility of our procedure, 
in the sense of Fisher and Van Ness [4]. 
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1. Introduction 

In the framework of the research of clustering by partition, none method can be 
defined as ‘the best clustering method’. We can however easily see that many of 
them are based on a distance. To avoid the arbitrary choice of distances, 
similarities and dissimilarities always posed in classification problems, we use in 
this paper a statistical method based on the Poisson point process. So, the 
Lebesgue measure which is both the canonical measure of the classification 
space and the measure induced by the Stationary Poisson point process ("the 
natural model for points distributed at random "(Fisher (1972) [5], Cressie 
(1991) [3],...)) will be considered here. 

The starting point of this article is the recall of a method based on the Stationary 
Poisson Point Process and on the Lebesgue measure: The Hypervolume method, 
proposed by Hardy and Rasson (1982) [6]. This method, assuming that the 
number of clusters (let us say k) is fixed beforehand, was based on the 
assumption of k convex clusters. The concept of clustering models for convex 
clusters has been analysed by Bock (1997) [2]. By the definition of the 
Hypervolume method, such a method is not able to find natural non convex 
clusters. So, we judged the hypothesis of convexity too strong, and desired a 
hypothesis which would also allow us to find natural non convex clusters. We 
used then the notion of connex and closed forms to solve this problem. The 
method based on this new hypothesis is developped: As the hypervolume 
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method, it is based on the Stationary Poisson Point Process and on the Lebesgue 
measure and requires to know the number of clusters (k) beforehand. We also 
expose the corresponding discriminant rule (after recalling the one 
corresponding to the convex case) and sift the procedure to the Fisher and Van 
Ness conditions (1971) [4] which "eliminate obviously bad clustering 
algorithms". We restrict ourselves in this paper in examining only the 2- 
dimensional case, but extension to is straightforward. 



2. Replacing the hypothesis of convexity 

2.1. The Hypervolume criterion 

The reader is referred to Hardy and Rasson (1982) [6] who proposed the 
criterion. For details about the Point Processes theory, see Karr (1991) [7] or 
Krickeberg (1982) [8]. 

Suppose that we observe n data points { XpJC 2 ,...,A:„ } inside some unknown 

domain Dc 91^ . The goal is to find out the natural classes present in the data, 
assuming that their number is k (k is fixed beforehand). We consider that the n 
observations to classify { } are a realization of a stationary Poisson 

point process in a domain D of which is the union of k disjoint CONVEX 
compact subdomains Z)/ (i=l,,..,k) (with Di unknown for all i). Applying the 
maximum likelihood method, the Hypervolume method consists in finding out 
the partition which minimizes the sum of the Lebesgue measure of the disjoint 
convex hulls of the points in the classes. 



2.2. The closed hypothesis 

The hypothesis of convexity used in the hypervolume method will be generalized 
here thanks to the notions of CLOSINGS and CLOSED sets, defined in Schmitt 
and Mattioli (1993) [11], Matheron (1975) [9] or Stoyan and Stoyan (1994) 
[14]. 

The (nx)rphological) closing of a set X by a set B is noted and is defined as 
the complement of the set of all points of covered by the translates of B that 

are inside X^. We restrict ourselves in considering only closings when 5 is a 
DISC of radius r>0. We will talk about closing of radius r. 

We also include in the term closing by a disc the case of a closing of infinite 
radius which is the same as the convex hull . 

A compact which is the same as its closing by a disc of some radius r is called 
closed. Let us now use the last definition in cluster analysis: 

We still suppose that we observe n data points { JCj, jc 2 ,...,jc^ } inside D (D 
unknown). We consider that the n observations are a realization of a stationary 
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Poisson point process in D which is the union of k disjoint CONNEX CLOSED 
compact domains D/ (/=7,. (We recall here that k is still fixed beforehand). 

The fact that the domains are chosen to be connex and closed can be explained : 
a domain must of course be connex to be considered as only one piece. 
However, the hypothesis of connexity is not sufficient : if we search for domains 
of minimal Lebesgue measure, we can indeed easily find connex domains which 
contain all the points and such that the sum of their measures is zero. Trivial 
cases like these must of course be avoided. 

So, we add the closed hypothesis to the connex hypothesis. There are indeed 
closed forms which are not connex. As domains must be connex, we can not 
take only the hypothesis of closed domains and we have to ask domains to be 
connex and closed. 

This hypothesis (connex + closed domains) generalizes the hypothesis of 
convexity because the convex are closed (a convex is equal to each of its 
closings (by discs)) (and a convex is of course connex) and because the set of 
closed forms is huge compared to the set of convex forms. 

We have got therefore a hypothesis giving no triviality cases and less restrictive 
than the convexity. 

Statistical solution : 

The aim is to estimate the k subsets Di, ..., Djt. 

With our hypothesis, the points wiU be independently and uniformly distributed 
on D and the Likelihood fonction will take the form : 



»=i 



l»u,) _ 1 

(A,(D)) amr 



i=l 



k 

where A^(D) (D, ) is the d-dimensional Lebesgue measure of D. 

i=l 

So, applying the maximum likelihood method, the domain D for which the 
likelihood is maximal is, among all those which contain all the points, the one 
whose Lebesgue measure is minirnal. 

Let us now define the optimal closing of a set of points as the connex closing 
(by a disc) whose Lebesgue measure is minimal. 

Thanks to our closed and connex hypothesis, for each partition of the set of 
points into k subdomains having DISJOINT closings, the likelihood has a local 
maximum attained with the optimal closings of the k subsets. 
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The global maximum will then be attained with the partition which minimizes the 
sum of the Lebesgue measures of the disjoint optimal closings of the points in 
the k subgroups. This is the solution we seek. 

We can see 2 examples of optimal partitions in figure 1 and figure 2, showing 
the ability of the method to find non convex natural clusters. 

Figures 1-2: Two examples of optimal partitions (which minimize the sum of the 
area of the disjoint optimal closings in the k clusters), with k=2. 





3 Corresponding discriminant rules 

3.1. Based on the convexity 

The details of the discriminant rule exposed here can be found in Beaufays and 
Rasson (1985) [1]. The allocation rule is : 



Assign X to the population if and only if the difference between the area of 
the convex hulls of the labeled sample of the population with and without 
the individual to be assigned is minimal. 



3.2. Based on the closings 

We now base our discriminant rule on the closed and connex model. 



So, we suppose that the n observations in 

the training sets are a realization of a stationary Poisson point process in a 



domain D of which is the union of k disjoint CLOSED and CONNEX 
compact subdomains Di (i=7,.. We want to assign an individual x to one of 
the k populations Di, 
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We follow here the same idea as the one exposed before : we judge the 
hypothesis of convexity too strong. So, we examine the case of closed and 
connex domains instead of convex domains as in 3.1, 

The conditional distribution for the population is assumed to be uniform in 
the closed and connex domain Du and the a priori probability/?, that an 
individual belongs to population i is proportional to the Lebesgue measure of 

Du 

The density of population /, /,(^), and the unconditional density, f(x}, are 
respectively equal to : 



and 






A 2(A) 



* 

fix) = X Pifii^) = ,r,\ 

t=i ^ 2^^ 



The decision rule is the one given by the Bayesian rule, with the unknown 
parameters -the convex sets D/- replaced by their maximum likelihood 
estimations. 

Let X-be the labeled sample of population /, F(X-) its optimal closing and x 
the individual to be assigned to one of the k populations. 

If X is allocated to the group, the maximum likelihood estimates of the 
domains Dj are, thanks to our closed and connex hypothesis : 

ifj = i 



The maximum likelihood estimate of p^f^x) is therefore : 



Pi fix) 



k 

'^X2iPiXj)) + S,ix) 

j=l 



where S,ix) = A 2 (F(X,U W)) - A 2 (/^(X, )) . 
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As the closed and connex sets are assumed disjoint, x can not be allocated to the 
group, unless F(X-U{^})) and F{Xj) are disjoint for all y /. Otherwise, 
we define 5 .(^) = +oo . 

The allocation rule is : 

Assign X to the population if and only if Pifi(x) > Pjfj(x) V 7 ^ i , 

which comes to 

Assign X to the population if and only if 5, (x) < ( jc) \/j ^ i 

namely 

Assign X to the population if and only if the difference between the area of 
the OPTIMAL CLOSINGS of the labeled sample of the population with and 
without the individual to be assigned is minimal. 



There are different possible cases about this allocation rule. 

The optimal closings of the population with and without the individual to be 
assigned can have the same radius. But of course, the case of different radius is 
also possible. 



Figure 3.* Illustration of the difference between the areas of optimal closings of 
the same radius used in the allocation rule based on the closings. 
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4. Admissibility of our procedure 

The problem of choosing a clustering procedure among the myriad proposed is 
perplexing. The approach proposed here is the suggestion made by L. Fisher and 
J.W. Van Ness (1971) [4]. As our procedure is based on Lebesgue measure and 
not on a distance, some problems will arise when the notion of distance is used 
in their admissibility conditions. So, we will divide the 9 conditions into 3 
groups. 

Among the conditions directly applicable to 9?^, we can prove that our 
procedure is point proportion, cluster proportion (these 2 conditions are 
trivially fulfilled by our procedure) and not cluster omission admissible. 
Moreover, it is not convex admissible... which is not surprising because our aim 
was precisely to be able to find non convex natural clusters. 

For the conditions expressed in terms of distance, we can only test admissibility 
on 91 : our procedure is image, k-groups and monotone admissible on 9t. These 

conditions can however be generalized to 9?^ . The last condition (well- 
structured exact tree admissibility) is only applicable to hierarchical methods and 
thus cannot be fulfilled by our procedure. 

Van Ness (1973) [15] added admissibility conditions to the precedent ones. 
Among them, we will point out one which is only applicable to clustering 
procedures with corresponding discriminant analysis procedures. Our procedure 
was proved to be repeatable admissible. 



5 . Conclusion 

Using closings in both clustering analysis and discriniinant analysis seems to be 
very interesting. Indeed, as the closed and connex hypothesis generahzes the 
convex hypothesis, we were able to create a clustering method and a 
discriminant analysis method generalizing the Hypervolume method and the 
corresponding discriminant rule. So, our aim to change the convexity in a less 
restrictive hypothesis was attained. Moreover, a lot of Fisher and Van Ness 
admissibility conditions were fulfilled by our procedure, which cannot then be 
considered as a bad clustering algorithm. Unfortunately, for the clustering 
method created, an algorithm seeking the best partition among all of these will 
not be suitable. So, researchs have to be made - and are made - to avoid the 
consideration of each different partition. As the domain of use of closed forms 
in clustering is quite unexplored, a lot of research domains are being examined. 
For example, the use of alpha-shapes in clustering instead of closings. 
Improvements of the method can also be made, such as the generalization to d 
dimensions, or the use of closings by squares or any convex instead of discs. 
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1. Introduction 

The application of artificial neural networks (NNs) in pattern recognition refers 
often to problems of discrimination., i.e. the assignment of objects to more or less 
specified classes {supervised learning). In contrast, the present paper deals with 
clustering problems {unsupervised learning). We consider a set O = n} 

of n objects described by multidimensional data points Xi, ....,Xn G or by an 
n X n matrix D = {d^i) of pairwise dissimilarities. From these data we want 
to construct an appropriate partition C = (Ci, ..., Cm) of O with ’homogeneous’ 
classes (7i,...,(7m, suitable class representatives z\.,....,Zm G and, possibly, a 
corresponding subdivision B = (Si,...,Bm) of the whole feature space R^ (in 
order to classify further data points). We show how problems of this type can be 
solved with neural networks and, conversely, how classical clustering methods 
are incorporated into the neural network approach (see also Murtagh 1996). 

Section 2 surveys the classical fc-means clustering approach and describes a gen- 
eral stochastic approximation method for clustering problems. Section 3 deals 
with a generalized Kohonen algorithm for constructing ’self-organizing maps’ 
(SOMs; Anouar et al. 1996,1997) and shows its relationship to the previous 
clustering methods. We propose a new ’continuous’ clustering criterion (A"- 
criterion) which reveals and characterizes the asymptotic performance of Ko- 
honen maps. We suggest some generalized SOMs along the lines of classical 
cluster analysis. Section 4 is devoted to Hopfield networks, we discuss their use 
for constructing a fuzzy classification of objects and recall some related classi- 
cal ’gravitational’ clustering methods. In Section 5 we describe how multi-layer 
perceptrons (MLPs) can be used for clustering purposes. 

2. k-means Clustering, Stochastic Approximation and Asymptotics 

In this section we recall briefly the ideas of fc-means clustering and stochastic 
approximation in order to combine them in Kohonen ’s algorithm in section 3. 
We start with the problem of partitioning the set O of objects with data points 
Xi^....,Xn G R^ into a given number m of ’homogeneous’ classes (7i,...,(7m by 
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minimizing the well-known variance or SSQ clustering criterion 

m 

■= ^51 min (2.1) 

t = l k£C, 

with respect to all m-partitions C = {Ci, (7^}- It is well-known (c/. Bock 
1974) that this is equivalent to each of the two following problems: 

1 m 

gn{C,Z) m\n=:gl (2.2) 

^ i=i tec. 

7 „(Z) ;= min {|lxfc-z,|P min (2.3) 

n l<3<m Z 

where minimization is over all m-partitions C and all m-tuples Z = (zi, ...,Zm) of 
’class centers’ z\^ Zm ^ BF • In fact, any solution C*, Z* of one problem yields a 
solution of the other problem(s) due to the stationarity equations z* = xc^ (cen- 
troids) and C* = { k e O \ \\xk - z*\\ = minj=i,...,m{|kfc - ^j\\} } {i = 
minimum- distance partition^ Voronoi classes). 

There are essentially two types of iterative algorithms for solving (approxi- 
mately) those optimization problems: (a) Methods which proceed batch-wise 
insofar as all n data points are processed or reclassified at the same time (see 
the fc-means algorithm (i) below), and (b) methods which proceed sequentially^ 
i.e., the data points X\,X 2 ^... are presented one after the other, and the m- 
partition and the center system obtained from the first data Xi, ...,Xn 
are suitably updated after observing Xn+i- It is this second type which resembles 
the learning algorithm paradigm in the NN context and includes, in particular, 
the method (u) of MacQueen (1967) and the stochastic approximation method 
{Hi) by Bravermann (1966) and Tsypkin & Kelmans (1967). The mentioned 
clustering methods (z), (n), (u‘i) can be described as follows: 

(i) fc-means algorithm: 

With a fixed number n of data points and an arbitrary initial m-partition 
we minimize the criterion gn{C^Z)^ (2.2), with respect to C and Z in turn and 
obtain, for t = 0, 1,..., a sequence of center systems Z^^^ (= centroids of C^^^) 
and m-partitions (= minimum-distance partition to Z^^~^^) which improve 
steadily on the criteria (2.1) to (2.3): 

fl-„(C<')) H5„(C<‘>,Z('^) > 5„(C<‘+'),Z<‘)) > 5„(C('+'>,Z“+'>) = 5n(C<‘+'>) 

( 2 . 4 ) 

until we end with a stationary value of which is (hopefully) close or identical 
to the global minimum g*. 

(ii) MacQueen’s sequential procedure: 

Begin with n = m data points and m singleton classes := {x*} for i = 
1, ..., m. Then, for n > m, if is the current m-partition of a:i, ..., a^n, the next 
observation Xn+i is assigned to the class with minimum distance ||xn+i — 
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r^(n)|| (t = competitive learning)^ thus obtaining an updated partition 

with classes U {n + 1} for i and for 



i ^ i*. Similarly, the class centroids := x^{n) are updated according to 



(n+l) _ 

Zl = XMn+l) = 



''i 

M 



+ Otin • (^n+1 - 4""^) for ^ 

for i ^ i* 



where the ’learning rate’ ain + 1) Ms obviously data dependent. 

(iii) Stochastic approximation: 

This algorithm concentrates on the sequence of class representatives. 

It begins, for n = m, with m distinct points z\^^ := X{ (i = 1, ..., m) and updates, 
for n > m and after observing o^n+i, the current system 
according to the recursive formula 

(n+l) _ / 4”’ + "in • {x„+i - 4"^) for i = i* 

' “ 1 (n) f ■ / 

{ for i ^ i , 

now with m predetermined, possibly class-specific sequences {ain}neN of ’learn- 
ing factors’ with limn-^oo = 0, <^in = oo and < oo (such as, 

e.g., ain = i/^)* A suitable sequence of m-partitions results if we define as 
the minimum-distance partition of {a^i, ..., generated by the center system 



The performance and asymptotic behaviour of all three methods has been in- 
vestigated under the assumption that the data is a sequence Xi,X 2 ,... of i.i.d. 
random vectors with Lebesgue density f{x) on RF (e.g., a mixture of m p- 
dimensional normals). The results refer to the solution B*,2* of the following 
continuous analogue of the clustering criterion (2.2): 

77i p 

g{B,Z) := \\x — Zi\\f • f[x) dx min =: p* (2.6) 

i=i 

where minimization is over all m-partitions B = {B\, Bm) of RF and all 
center systems Z = (zi, Zm)^ Note that optimum configurations B*,Z* will 
typically be related by similar stationarity conditions as before (eventually after a 
relabeling of class indexes): z* = Ej[X\X G B*] is the conditional expectation of 
X in the domain B* (under /), and B* = {x £ RF\ ||x~ 2 :;|l = mmj{\\x- z’^W} } 
is the Voronoi region in R of the centroid z*. With this notation we obtain, for 
a sufficiently regular density / : 

Theorem 1: 

(i) Denote £*(^) an optimum configuration for the classical SSQ criteria 

(2.1), (2.2), (2.3) with a sample size n, and by g* = gn{C*^'^\ = Pn(C*^’"^) 

the minimum criterion value. If the optimum m-partition B* of the continuous 
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criterion (2,6) is unique, then for n oo and i = 1, 

^ z: = E[X\XeB*] (2.7) 

9n ■= mni Z) g* Z) = g{B* , Z*). (2.8) 

in probability and almost surely (Pollard 1981, 1982, see also Bock 1985, 1996a). 

(ii), (Hi): Under some regularity conditions for f analoguous limits are obtained 

for the centers resulting from the sequential algorithms (ii), (Hi) when n -> 
oo: 

^ z; = E[X\X e 5*] (2.9) 

Pn := 5 n(C(">,Z<”>) ^ g- :^g{B\Z-). (2.10) 

in probability and almost surely, where B*,Z* is a stationary pair (local mini- 
mum) for g{B,Z) (and might sometimes be the unique global optimum). 

This theorem underlines that all three algorithms are implicitly resolving the 
same continuous partitioning problem (2.6). Details for (ii) are given by Mac- 
Queen (1967), for (Hi) see Braverman (1966), Tsypkin & Kelmans (1967) and 
Bock (1974, Chap. 29). 

As a matter of fact, the results of Braverman pertain even to generalized parti- 
tioning criteria such as 

m » 

G(B,Z) := Y, (pi(x;zi,...,Zm) ' f(x) dx min (2.11) 

where the squared Euclidean distance ||x — Zi\\^ is replaced by a generalized 
distance function (f)i(x; Z) = cl)i(x; Zi, Zm)^ e.g., by the Lp-distance ||a: — Zi\\p 
(for some p > 1) or even a class-specific distance which depends on all centers 
zi,...,Zm as we will see in (3.2), (3.3), (3.4) below. Tsypkin h Kelmans (1967) 
derive stationarity conditions for an optimal pair (B,Z): Whereas the optimum 
centers Zi G must fulfill the ’normal equations’: 

m „ 

X)/ '^zi<f>t'ix;zi,...,Zrn) f(x)dx = 0 i = l,...,m. (2.12) 

the optimum classes B{ C are the Voronoi regions belonging to the Zi (using 
the metric cf)i(x,Z). 

In analogy to the SSQ case (2.2), we can formulate a related finite-sample clus- 
tering problem by: 

1 m 

Gn{C,Z) := 4>i(xk]Zx,...,Zm) -)• min. (2.13) 

” .=1 tec. 

Even if we can easily design a fc-means-type algorithm for Gn(B,Z), the mini- 
mization with respect to Z and the solution of (the empirical version of) (2.13) 
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:an be computationally difficult. On the other hand, we can easily formulate 
;he stochastic approximation algorithm for minimizing G: After observing 
ve update the current center system follows: 



- a.n • 



vith m class-specific learning sequences {ctin)neN- For ^ dissimilarity of the type 
-t)i[x;Z) = (f){x — Zi) this reduces to: 



4"'* - “m • '^x<i>{Xn+l - 4"*) for i - i* 

for i ^ i* 



vhere i = z* denotes the class with minimum distance (f){xn-\.i — {cf. (2.5) 
br (f){x) = ||a:||^/2). Similarly as before we obtain (Tsypkin & Kelmans 1967): 

Theorem 2: 

If Xi^ X 2 , ... are U.d.with a suitably regular density f the sequence of center 
systems obtained from stochastic approximation (2.14) together with the 
corresponding minimum-distance partitions (using the class-specific dis- 
tances 0t(x,Z^^^)j converge in probability and almost surely to a stationary 
oair of the criterion G(B.,Z). Similarly^ the corresponding criterion 

values G^^^ := G(B^^\ converge to G* := G{B*^Z*). 

[n the following section this theorem will be used for characterizing the asymp- 
totic performance of Kohonen maps. 



3. Clustering Methods for Classical and Generalized Kohonen Maps 

Clustering methods are implicitly used in the construction of ’self-organizing 
maps’ (SOM’s; Kohonen 1982, 1995), a well-known method in the NN frame- 
work. We show how it is related to the previous clustering approaches and 
present various new or recent generalizations of Kohonen’s basic proposal. 

Kohonen’s algorithm starts from a sequence of high-dimensional data vectors 
ri, X 2 , ... in [learning sample) and tries to find an illustrative display of these 
points in a (given) rectangular, hexagonal,... lattice C of points in or R^ 
5uch that the high-dimensional structure of the data (neighbourhoods, topolog- 
:cal ordering, clusters) can be easily detected in this line diagram. In statistics 
this problem has been tackled, e.g., by principal component analysis, projection 
pursuit clustering or multidimensional scaling (see Bock 1997). In contrast, Ko- 
tionen has proposed the following sequential strategy (where we consider w.l.o.g. 
i rectangular a x b lattice C == {P, = (r, s)\r = 1, ..., a; s = 1, ..., b} of R^): 

[1) For each sample size n we look for a (large) number m = a-6 of ’mini-classes’ 

...G!^^ for {a:i, ..., Xn) which are homogeneous in some SSQ sense, a system 
3f suitable class representatives Zi^\ . . . G R^ and the minimum-distance 
partition of R^ generated by Z^^b 

[2) Each class representative Z{ = zl'^^ is assigned to one of the vertices Pi of 
bhe rectangular lattice £. (Note that neural network terminology calls Pi a 
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’neuron’ and Zi a ’weight vector’). Ideally this assignment 0 should be such 
that close or neighbouring class centers Zi^Zj in are mapped into vertices 
P' := P" ■= P^i) which are close or neighbours in the lattice C as well, 

i.e. with a small path distance S{P\ P"). This is attained by the following con- 
struction: 

(3) Similarly as in (2.5) or (2.14) we define a sequence of center systems = 
( 2 r|^\ ..., 2 T^^) in a sequential way, but with a modified updating formula (3.1) 

below: We start, for n = 0, with m random points z[^\...,zl^^ from they 
are arbitrarily assigned to the lattice points Pi G C (e.g. by ^(i) = Pi] this 
assignment is never changed in the sequel). If now, for n > 0, a new data vector 

Xn+i is closest to the class center z^?^ among all current centers z[^\ we 

Update Zi* by moving it towards and simultaneously all other class cen- 

ters Zj^"^ whose counterparts Pj are close to, or neighbours of. Pi* in £. A quite 
general updating formula is given by: 

4”^’* = • ^i^(Ph Pi-)) ■ (^n+1 - zf^) j = 1, m (3.1) 



where the transformation K{5) is typically decreasing from A"(0) = 1 to 0 and 
controls the influence of more or less neighbouring vertices Pj of C. Kohonen’s 
original methods are obtained for the threshold function A"((J) := 1 or 0 if <5 < 
or > e, respectively (with a given distance threshold e > 0) and a bathtub func- 
tion K (’Mexican hat’) where positive (negative) values of K yield attraction 
(repulsion) effects in (3.1). 

In this context Anouar, Badran h Thiria (1997) have defined the following finite 
sample clustering criterion: 



~9n{C.Z) 



1 m m 

E lEmP,Ps)) • Ik - ^ min =: (3.2) 



where the bracket 



</..(x;Z) := (3.3) 

j=i 

defines a dissimilarity between a data point x E R^ and the vertex Pi from C (or 
the mini-class Ci) in which the term \\xk — Zi\\^ (i.e., with j = i) has maximum 
weight A"(0) = 1. 

As a continuous counterpart we define the K-criterion: 

1 m ^ m 

9 {P,Z) := -X! / EA'(^(P.,Pi))||a:- 0 j|p]/(a:) rfa: ^ min =: ^* (3.4) 

which is obviously a special case of the criterion (2.11). Similarly as in section 
2 we can derive for the criteria (3.2) and (3.4) three ’self-organizing’ clustering 
or mapping methods (P), (u*), {in*): 
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(i*) A finite-sample /c-means analogue of Kohonen’s method: 

We minimize gn{C^Z)^ (3.2), partially with respect to the center system Z and 
the m-partition C in turn: (a) From a current m-partition {t > m) we obtain 
the optimum centers: 

m 

= £) w\f ■ x^(.) i = 1, m (3.5) 

3=1 ^ 

as a weighted mean of class centroids where the normalized weights 

\Cf\-K{5{PuP,))lv^^ with 

m 

4^ := ^\4^\-K{5iP,Pi)) 

i=i 

depend on the current class sizes (b) Inversely, for a given center system 

Z = Z^^\ (3.2) is minimized by the minimum-distance partition with 

classes 

Cf+i) {ke{l,...,n}\i = ^Tgmmj(l>,{xk,Z^^^)} (3.6) 

with the class-specific dissimilarity measure (3.3). 

(ii*) An analogue to MacQueen’s sequential algorithm: 

With the notation from (iz), a new data point is assigned to the class 
with minimum dissimilarity 2^^^) from Xk (for i = say), and all centers 

are suitably updated: 

m 

^|n+i) ^ •Xp(n+1) i = (3.7) 

j=i ’ 

(iii*) The stochastic approximation algorithm: 

Since the A-criterion (3.4) is a special case of the criterion G(B^Z\ (2.11), we 
can directly apply Tsypkin & Kelman’s updating formula (2.14). Calculating the 
gradients in (2.14) yields the generalized Kohonen algorithm (3.1) which there- 
fore proves to be the stochastic approximation device for the new A-criterion 
[see also Anouar h al. 1997). 

The asymptotic behaviour of the three algorithms (z*), (ii*), (uz*) is character- 
ized by Theorem 3 which assumes a sufficiently regular density / for the data 
^ 1 ,^ 2 ,...: 

Theorem 3: 

[z”") : Assume that the optimum m-partition S* for the K -criterion g, (3A), is 
unique and denote by the optimum configurations for Anouar ’s SSQ 

criterion gn, (3,2), for a given sample size n, and by g* = gn{C*^'^\ the 

minimum criterion value. Then, for n oo and i = 1, 



m 


m 

^ Q-.= Y,w*^-E\X\XeB:] 


(3.8) 


3=1 


i=i 




9n ■= Z) 


g* — ram g{B,Z). 


(3.9) 
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in probability and a.s. with normalized weights w*j := P(X E Pj)) I w* 

and w: := P{X G B;) • K{5{P,, P^)), 

[Hi*) : Similar convergence properties hold for the sequence of center systems 
of the generalized Kohonen algorithm (3.1), i.e. the stochastic approxima- 
tion method for the K -criterion (3.4): 

4”' ^ C (3.10) 

^ g-:=g{B\Z*). (3.11) 

where (B*, Z*) is a stationary pair for (3.4) and the minimum-distance par- 
tition of{xi, Xu) corresponding to the center system Z^^^ (using the distances 
(f)i). 

The importance of Theorem 3 resides in the fact that it characterizes the per- 
formance of (original and generalized) ’self-organizing’ Kohonen algorithms and 
can be a key for investigating the well-discussed question if Kohonen’s algorithm 
produces ’topologically correct’ results, e.g., in the case where the underlying 
density f{x) is concentrated near a low-dimensioanl manifold of R^. - Related 
problems are investigated in Cottrell & Fort (1989), Tolat (1990), Bouton & 
Pages (1993), Fort & Pages (1996) and Ambroise & Govaert (1996). 

The criterion-based approach to Kohonen networks suggests various generaliza- 
tions which are known from classical cluster analysis and may be appropriate 
in practical situations. Apart from the use of Mahalanobis-type distances and 
Lp-distances we mention models where each class is characterized (a) by a class- 
specific hyperplane Hi (instead of a single point Zi only) as in principal compo- 
nent clustering (Bock 1974, 1996b, c), see also the approaches by Ritter (1997), 
Moshou & Ramon (1997), Kohonen et al. (1997) and Bock (1998b); (b) by a lo- 
cal non-linear (e.g., quadratic) surface (c/. Ritter 1997); (c) by regression planes 
(as in Bock 1996b,c), see Badran et al. 1997) or (d) by a general class-specific 
distribution density f{x,di) as in maximum-likelihood clustering (Bock 1996b), 
see also Bock (1998b). 

4. Hopfield Networks for Clustering Purposes 

Hopfield networks are designed to minimize a smooth loss function L{vi, ..., vj\j) > 

0 with respect to N real-valued variables Ui, In the clustering framework 

L denotes a clustering criterion and Ui, ..., ujv describe a classification (e.g., class 
indicators, class centers, fuzzy memberships etc.). The method proceeds by 
analogy to the motion of N particles in time t > 0 which migrate under a sys- 
tem of forces (Hamiltonian differential equations) and approach finally a steady 
state (uj,...,u^) which provides a (global or local) minimum of the ’potential 
energy function’ L. 

More specifically, each variable V{ is modeled as a time function Vi{t) := (r{ai{t)) 
(position of the particle i, output of the ’neuron’ i) with an activation function 
ai{t) and a prespecified smooth sigmoid function cr, e.g., the logistic function 
a{a) = 1/(1 -f e“). The activations ai{t) evolve in time according to the system 
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of differential equations (d.e.): 

d 

a.(<) = - -^L{vi{t),...,VN{t)) t>0, i = (4.1) 

L is called a Lyapunov function for the system (4.1) which means that the loss 
function K{t) := L(ui(t), u;v(0 ) decreasing in t > 0 along any solution 
of (4.1); this follows easily from K{t) = Eili ^ 0. 

Therefore the limits a* \\mt^ooO>i{t) exist and the corresponding ’outputs’ 
V* := (r{a*) yield a (local or global) minimum //(u*, ..., uj^) of L. (Note that we 
did not mention a ’neural net’ in this description.) 

This classical optimization method can be applied to any clustering problem if 
we re-formulate it as a minimization problem L -> mm, solve the corresponding 
system of differential equations (4.1) for ai{t) and determine the steady states a* 
and V* which are then re-interpreted in terms of an optimal clustering of data. 

We exemplify this approach by two special cases taken from Adorf & Murtagh 
(1988) and Kamgar-Parsi & al. (1990). Both papers look for di fuzzy classification 
^ = (^ifc)mxn of n objects where Uik G [0,1] is the degree of membership of 
the k-th object in the z-th fuzzy class t/, which leaves N = m • n variables 
Uik = = cr(flt'A:) to be determined. Following the classical fuzzy clustering 

approach (Bock 1979, Bezdek 1981), Kamgar-Parsi & al. minimize a suitably 
modified SSQ criterion (4.2): 

m n m n 

5i(^) := = (4-2) 

i=l k=l t=l k=l 

with class centers xu, := E^=i '^lk^k]/[Yl'k=i '^^k) deviations Aki{lL) := \\xk — 
xc/Jp. They add some weighted penalty terms: 

• 92 ( 1 ^) = '^ik — 1]^ for forcing the norming conditions '^ik = 1, 

• 53(Z^):=ELi[Er=i '^ik'^jk] for reducing the overlap between classes, 

• 9a{J^) •= /o^‘* <^~^{y)dy for forcing the U{k to the boundary of [0, 1]. 

This yields the clustering criterion: 

^ * 5^2(Z^) + “ • 5^s(Z^) + 5^4(Z^) -> nun (4.3) 
with positive weights a,/?, 7 and the following system of d.e. (4.1): 



aik{t) = -aik{t) - a • Vik{t) • Aik{t) - /? * E tiijt - 1] - 7 * 

i=i 






2 ^ Ail 



It is solved by methods from numerical analysis, simulation or an analog device, 
eventually in discrete time. The limits for t — )> 00 yield a (sub-)optimum fuzzy 
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classification ZY* = Note that Kamgar-Parsi et al. cancelled the last term 

in the system (4.4) (with dAu/duik = ~{xi - x^Yixk - ^f/J/(Er=i ) in 
order to get a linear system of d.e.’s. Moreover, they used a rescaled logistic 
function Uik = d(a,jt) := cr{{aik — with various choices for the thresholds 

dik and the scale r > 0 (where t 0 yields ’harder’ classifications). 

In contrast,, Adorf & Murtagh (1988) started from an n x n matrix D = {dki) 
of pairwise dissimilarities dki for n objects and looked for an optimum fuzzy 
m-partition U in the sense: 



m 



^kiuikUii - T Z! S ^klu,ku^^ + ^ X XI S huikUjk 



1=1 k=i 1=1 



i=l A:=l /=1 



t=l A:=l 



2 m m n n 

“o ’imZIZZ] j( -?• min. (4.5) 

^ .=1 i=\ (t=i /=i ^ 



Here the first term supports a small heterogeneity in the classes, the second one a 
high separation among the classes, and the third term little overlap for all objects 
k (with constant weights oc^f3^^ and hk (typically hk = b)). This is a quadratic 
function in U with coefficients Wikji := -a-dkiSij-\-f^-dki{l-Sij)-'ybk‘{l- Sij)5ki 
(and 5ki G {0, 1} the Kronecker symbol) which yields a linear system of d.e.’s: 



dik{t) 






j=i 1=1 



t > 0 



(4.6) 



for i = 1, ..., m, k = 1, ..., n. As before, the limiting values for Z — oo result 
in a stationary or optimal fuzzy classification U* = (i/*^) = (cr{a*i^)). 

Clustering with Hopfield networks resembles very much the classical gravita- 
tional clustering mei/iods proposed by Butler (1969), Coleman (1971) and Wright 
(1977) for n data vectors Xi,...,a:n G BF . Essentially, these methods consider 
N := m class centers Zi = Z{[t) G as particles i moving in (discrete) time 
t > 0 from some arbitrary initial states 2:,(0) G along trajectories which are 
controlled by the ’gravitational’ forces exercised by the data points Xi^...^Xn 
and derived from a suitable ’potential energy function’ L. Centers which ap- 
proach each other too closely, are aggregated. Typically, the centers converge, 
for t oo, to some few locations G R^ which characterize the regions 

where the data Xi, ...^Xn are most concentrated (modes). Insofar these locations 
can be considered as ’representatives’ of classes (7, which collect all data points 
which are close to the same center 

5. Multilayer Perceptrons for Additive Clustering Models 

A Multilayer Perceptron (MLP) is built from several layers of elementary pro- 
cessing units i (’neurons’) where each one realizes a nonlinear function of the type 
V = of input values ui, U 2 , ... with a sigmoid function a. An MLP is 

designed to solve optimization or approximation problems by an optimal choice 
of the (numerous) ’weights’ Wj in the system (e.g., by using the backpropagation 
algorithm). Therefore an MLP can solve clustering problems provided they are 
formulated as a minimization problem (with continuous weight variables). As a 
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typical example, we sketch the approach of Sato &: Sato (1995, 1998) who start 
from an n X 77, matrix S = {ski)nxn of pairwise similarities s^i > 0 and look for 
an optimal fuzzy classification U = {uik)mxn- 

By generalizing the classical additive clustering model (AD CL US) of Shepard & 
Arabie (1979), they assume that each pair of fuzzy classes Ut, Uj has an unknown 
underlying similarity Wij such that the observed similarities Ski are modeled, up 
to random errors, by the additive pseudo-similarity aki = 

for any pair A:, Z of objects. This model uses a prespecified aggregation function 
p(uik,Uji) that weights the simultaneous appearance of the events ’A: 6 Ui with a 
degree Uik and 7 G Uj with a degree Uji (e.g., p{u,v) = u-v, min{i6,t>}, uv/[l — 
(1 — u){l — t;)] etc.). The model dki is adapted to the data Ski by minimizing 
the clustering criterion: 



G{U,W) := ^ (5.1) 

with respect to all fuzzy partitions U. This problem has been solved by Sato 
& Sato with the help of an MLR Since the symmetry of s^i or Wij is not used 
in (5.1), their method applies to non-symmetric data as well, e.g., if Sj^i is the 
amount of telephone traffic from one town k to another one with a similar 
interpretation for W^j on the cluster level. In a similar way, other additive least- 
squares decompositions of data could be handled (e.g., overlapping clusterings 
or MUMCLUS models as in Carroll & Chaturvedi 1998). 

6. Other Neural Network Approaches 

The previous list of neural network clustering approaches is by far not exhaus- 
tive. As illustrated by the previous cases, many well-known classical clustering 
(minimization) problems could be reformulated in the NN framework even if, by 
definition, an NN can scarcely yield ’better’ clusterings than classical minimiza- 
tion methods. The borderline between classical and NN approaches is a fuzzy 
one, as exemplified by the associative resonance theory (ART) (Grossberg 1987) 
where several clustering approaches combine aspects from classical ISODATA 
with neural network terminology (Postma & Hudson 1995). Firmin & Hamad 
(1996) consider the usual mixture approach in the NN context and Kovalenko 
(l996) designs a NN method for detecting high-density clusters from density 
estimates. 
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Abstract: Kohonen's self-organizing maps (SOM) belong to the group of artificial 
neural network methods that are the most frequently applied to data analysis. The 
most common applications of SOM are multidimensional data visualization and 
huge data sets clustering. Some characteristics of SOM make this method 
interesting also in other aspects of data analysis. The following paper presents the 
possibility of SOM application to outliers identification and missing data 
estimation. 
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1. Self-Organizing Maps Method 

SOM is an unsupervised artificial neural network, and is frequently used in many 
areas of research and in practice (3014 Studies ..., 1997). Usefulness of SOM is 
proved by its: 

• Uniqueness, i.e. lack of direct similarity to other methods of data analysis 
(Sarle, 1994); 

• Robustness; 

• Rapid training; 

• Possibility of analyzing huge data sets; 

• Possibility of analyzing corrupted data (incomplete data vector) 
(Kohonen, 1995); 

• Biological analogy. 

The main area of SOM applications is data analysis - the discipline in which no 
additional information about data set (e.g. number of groups, group membership 
etc.) is provided. From this point of view, in many papers, SOM is presented as a 
visualization (Ultsh, 1993) and clustering method (Murtagh, 1995; Murtagh & 
Hemandez-Pajeraz, 1995a, 1995b). SOM, as a clustering algorithm, is especially 
used in case of huge data sets. 

In SOM all neurons (called nodes) are arranged in one- or two-dimensional array, 
however other dimensionalities are also possible. 

In its basics, SOM algorithm consists of two phases which are repeated for every 
input object for a number of times; each step is called a presentation (Kangas, 
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1994; Kohonen, 1995; Kohonen et al., 1995). If the number of objects is small, 
e.g. less than 1000, objects are presented to the map recursively and consequently 
the total number of presentations would be in the range of 10,000 - 100,000 
(Kohonen, 1995). 

Phase 1. Finding the matching node. 

For every pattern x(/) from input space a matching node on the map should be 
found: Wy. In order to find the matching node, in majority of cases the Euclidean 
distance is used. Sometimes the dot product of x^v is used, but in this case the 
input vector normalization is necessary. 

||x(0 - w,„(0|| = min{|x(0 - wyo||} (1) 

Where m is the index of matching node; t - number input object presentation; j - 
node index. 

Phase 2. The adaptation of the matching node and its neighbors. 

Nodes vector weights are being adopted. The goal of this phase is to lower the 
distance between the presented object x(/) and its matching node (w,) together 
with its neighbors. If the distance measure is Euclidean distance, the adaptation 
formula is as follows: 



%,(/ + 1) = w ,(0 + (2) 

Where wji(t) is i-ih component of y-th node weight vector in Mh learning step; 
r/{t) ~ learning parameter, 0 < 7]{t) < 1 (decreasing with t), hmj{t) - neighborhood 
function, m - index of matching node, j - updated node index, t - learning step, 

0<M0< 1- 



The unique mechanism that distinguishes SOM from other competitive-learning 
networks is the neighborhood function - The main task of the learning 

function is to provide learning (weights updating) not only for best matching node 
(winning neuron), but also for its neighbors. Commonly used types of 
neighborhood functions are as follows: 



KAt) = 





r - r 


1 


m J 


0 ; otherwise 



<G{t) 



( 3 ) 



Where: Ym - coordinates of matching node; Vj - coordinates of updated node; 
o(0 - neighborhood radius - decreasing function depending on t. 
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Kj(t) = e 



(4) 



SOM can be perceived as: 

• Nonlinear transformation of any multidimensional continuous space into one- 
or two-dimensional discrete space, which preserves order of objects from 
input space. In this sense SOM may be seen as similar to principal component 
analysis (Sarle 1994; Oja et al., 1995), multidimensional scaling 
(Jajuga, 1990, 1993) and clustering methods; 

• Nonparametric, nonlinear regression (Cherkassky & Lari-Najafi, 1991; 
Kohonen 1995) which attempts to fit number of codebook vectors to the 
distribution of objects in the input space. 



Interested reader may find a detailed description of SOM algorithm and related 
topics in (Kohonen, 1995; Kohonen et al. 1995). 



2. Outlier Identification 

Many of data analysis methods assume data set homogeneity, whereas in reality 
this assumption is not fulfilled. Such terms as elliptical homogeneity and the 
homogeneity in the sense of linear regression are defined in (Jajuga, 1993). The 
author points out the importance of the use of robust analytical methods and the 
necessity of outliers identification. 

SOM can be helpful in dealing with the issue of data set homogeneity. As a 
visual-grouping method, SOM may be used in data set structure analysis. 
Additionally its nonparametric nonlinear regression capabilities may be used in 
visualizing data set homogeneity and identification of non-typical data, especially 
in case of multi-modal data sets. 

Others are much different from the rest of data. In order to distinguish between 
typical and non-typical data, it is necessary to define a criterion of difference. This 
criterion can be defined as the distance (i.e. Euclidean distance) from a data object 
to the linear regression hyper-surface. The object located far (in terms of distance) 
from the hyper-surface can be perceived as non-typical. 

However in multi-modal data sets the criterion of distance to linear regression 
hyper-surface cannot be accepted because the linear regression hyper-surface does 
not match the modal structure of the data set. In this case the use of a nonlinear, 
nonparametric regression method is necessary and SOM, being the one, can serve 
the purpose. 
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As it has been stated in then following paper, SOM can be perceived as a 
nonlinear nonparametric regression which is formed (in the learning process) in 
elastic lattice. Non-typical data, because of their uniqueness, are unable to form 
their “own” nodes - codebook vectors; this makes them distinct from the lattice. 
The measure of non-typicality is referred to as a quantization error in SOM 
literature. The quantization error connected with the z-th object of the input space, 
is defined as follows: 

0£,=min{||x,-w,|} (5) 

Where: x, - z-th input object, z=l, ...,5; w, - y-th codebook vector, 7 = 1, ...,M 
The SOM algorithm is to find such a map in which an average quantization error: 

AQE = \Y^QE, ( 6 ) 

reaches its minimum. 

Such functions as Euclidean distance, Mahalanobis distance and others may be 
used as a distance measure. 

The method for outliers identification can be summarized in the following steps: 

1 . Definition of the SOM with a satisfactory average quantization error - done in 
map design and learning process; 

2. Data processing done by the defined SOM; 

3. Outliers identification according to the height of quantization error. 

4. Data non-typicality visualization through presentation of quantization error 
associated with every datum in a coordinate system. 

The issue, which demands further studies, is to find the value of quantization error 
which would allow for classifying certain datum as typical or non-typical. 
Obviously it is necessary to analyze the quantization error distribution. In case 
when this distribution would be, for example normal, the demarcation criterion 
could be the three-sigma rule. 

Taking into account the multimodality of the analyzed data sets, it seems that two 
cases are possible: 

1 . Assume that the distribution is based on all quantization errors, regardless of 
the group the quantization error is coming from. In this case finding the 
parameters of one distribution is necessary; 
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2. Assume that every group has its own quantization error distribution (in the 
sense of parameters). In this case it would be necessary to estimate parameters 
for each group in the data set. 

3. Estimation of Missing Data 

An important problem in data analysis often addressed by many authors, is their 
incompleteness (Ghahramani & Jordan 1994; Galuszka 1994, 1995; Kordos, 
1988) and probably a satisfactory solution for this problem does not exist at all. In 
certain disciplines, e.g. technical, missing data can be obtained by repetition of the 
experiment, in others, e.g. socioeconomic, it is not possible. In this case an 
analysis of incomplete data is necessary. 

Considering missing data we usually think about multidimensional vectors, in 
which there are empty places in certain positions of the vector. There are 
following alternatives of dealing with incomplete data: 

• Elimination of vectors or columns containing missing data from the data 
matrix; 

• Estimation of missing data (using different methods); 

• Application of robust data analysis methods. 

SOM is a robust nonparametric data analysis method and can be used in order to 
estimate missing data. To fulfill this task certain modifications have to be done to 
classical SOM algorithm. These modifications are described in (Kohonen, 1995) 
and implemented in SOM PAK software (Kohonen et al., 1995): 

• In the first phase of SOM algorithm, the distance between codebook vector 
and analyzed object is calculated excluding missing data components. It 
means that the number of missing positions reduces the order of the vector. 

• In the second phase of SOM algorithm, only these components of codebook 
vectors are updated in which there are no data losses. 

This modified method was used by Kohonen in visualizing world’s poverty map 
(Kohonen, 1995). 

The algorithm mentioned above provides not only visualization but also 
estimation of all missing values. This is possible even when all data vectors are 
corrupted. Estimated values are represented by these output node weight vectors' 
components which are related to empty places in input objects. Naturally, SOM 
for missing values can be applied only in the case when missing data are not 
significantly different from the existing data, being the base for missing data 
estimation. 

In order to test the ability of SOM to estimate the missing values, the experiments 
were made on Iris data file. Iris is a data set containing 150 observations 
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concerning Iris flower characteristics, used for the first time by Fisher in 1936. 
The measured features are the following: length of the leaf, width of the leaf, 
length of the petal and width of the petal. The observations come from three Iris 
kinds: Iris setosa (observations 1 - 51), Iris versicolour (observations 51 - 100) 
and Iris virginica (observations 101 -150). This data set is used as a benchmark 
file in data analysis procedures testing. 

The author has decided not to present Fisher's Iris file in detail assuming its 
popularity. Interested reader may find it in many publications, for example in 
(Jajuga, 1990). 

The experiments were conducted using SOM PACK software (Kohonen et al., 
1995). 

The test files were prepared in such a way that certain vector components were 
randomly eliminated. In the experiments five data files were used: 1-st original 
Fisher’s file with no corrupted data, 2-nd with 90% of existing data, 3-rd with 
80% of existing data, 4-th with 70% and 5-th with 60% of existing data. The SOM 
was compared with the most commonly used method of missing data estimation - 
substitution by means of the column. 

Following measures of error were used: 

Absolute error 



^ = 1^. -^1 

Where: ~ pattern value; x - estimated value. 

Relative error: 



Erel = — (8) 

Because the data objects in Iris data set are 4-component vectors, the above error 
measures were modified in the following way: 



E = 




( 9 ) 




Erel = 



( 10 ) 
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The above formula can be interpreted as the Euclidean distance of absolute and 
relative error vectors. 

The results of experiments showed that the aggregated relative estimation error 
(10) in case of SOM method, tested in different variants of neighborhood and 
topology parameters, was approximately 4-5 times lower than in case of 
substitution-by-mean method. 

These results prove the usefulness of SOM in the given application. 



4. Conclusion 

SOM besides its common application - visualization and clustering - may be used 
in other non-trivial data analysis tasks: outliers identification and estimation of 
missing data. This is possible due to SOM's robustness and nonparametric 
nonlinear regression characteristics. The features of SOM algorithm, mentioned 
above, demand further deep studies. 
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Abstract. Neuro-fuzzy classification systems offer means to obtain 
fuzzy classification rules by a learning algorithm. It is usually possible 
to find a suitable fuzzy classifier by learning from data, but it can 
be hard to obtain a classifier that can be interpreted conveniently. 
However, the main reason for using fuzzy methods for classification is 
usually to obtain an interpretable classifier. In this paper we discuss 
the learning algorithms of NEFCLASS, a neuro-fuzzy approach for 
data analysis. 



Keywords: fuzzy classification, neuro-fuzzy system 



1 Introduction 

Neuro-fuzzy systems are approaches to learning fuzzy systems from data by 
using learning algorithms derived from neural network theory. The learning 
capabilities of neural networks made them a prime target for a combination 
with fuzzy systems in order to automate or support the process of developing 
a fuzzy system for a given task. The first so-called neuro-fuzzy approaches 
were considered mainly in the domain of (neuro-) fuzzy control, but today 
the approach is more general. Neuro-fuzzy systems are applied in various 
domains, e.g. control, data analysis, decision support, etc. 

Modern neuro-fuzzy-systems are usually represented as a multilayer feed- 
forward neural network [Berenji and Khedkar, 1992; Buckley and Hayashi, 
1995; Halgamuge and Glesner, 1994; Nauck et al., 1997; Tschichold-Giirman, 
1997]. In neuro-fuzzy models, connection weights and propagation and ac- 
tivation functions differ from common neural networks. Although there are 
a lot of different approaches we want to restrict the term “neuro-fuzzy” to 
systems which display the following properties: 

Acknowledgement: The research presented in this paper is partly funded by DFG 
contract KR 521/3-1 
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(i) A neuro-fuzzy system is a fuzzy system that is trained by a learning 
algorithm (usually) derived from neural network theory. The (heuris- 
tical) learning procedure operates on local information, and causes 
only local modifications in the underlying fuzzy system. The learning 
process is not knowledge based, but data driven. 

(ii) A neuro-fuzzy system can be viewed as a special 3-layer feedforward 
neural network. The units in this network use t-norms or t-conorms in- 
stead of the activation functions common in neural networks. The first 
layer represents input variables, the middle (hidden) layer represents 
fuzzy rules and the third layer represents output variables. Fuzzy sets 
are encoded as (fuzzy) connection weights. This view of a fuzzy sys- 
tem illustrates the data flow within the system, and its parallel nature. 
However this neural network view is not a prerequisite for applying a 
learning procedure, it is merely a convenience. 

(iii) A neuro-fuzzy system can always (i.e. before, during and after learning) 
be interpreted as a system of fuzzy rules. It is both possible to create 
the system out of training data from scratch, and it is possible to 
initialize it by prior knowledge in form of fuzzy rules. 

(iv) The learning procedure of a neuro-fuzzy system takes the semantical 
properties of the underlying fuzzy system into account. This results 
in constraints on the possible modifications applicable to the system 
parameters. 

(v) A neuro-fuzzy system approximates an n-dimensional (unknown) func- 
tion that is partially given by the training data. The fuzzy rules en- 
coded within the system represent vague samples, and can be viewed 
as vague prototypes of the training data. A neuro-fuzzy system should 
not be seen as a kind of (fuzzy) expert system, and it has nothing to 
do with fuzzy logic in the narrow sense [Kruse et al., 1994]. 

In this paper “neuro-fuzzy” has to be understood in the way given by 
the five points above. Therefore we consider “neuro-fuzzy” as a certain 
technique to derive a fuzzy system from data, or to enhance it by learning 
from examples. The exact implementation or “neuro-fuzzy model” does not 
matter. 

In this paper we discuss neuro-fuzzy classification and use our NEF- 
CLASS model as an example. In the following section we describe the model 
and its learning capabilities. After that we given an example where we apply 
NEFCLASS to a real-world data set. 
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2 Classification with NEFCLASS 

Classification of data is an area of application where statistical methods, 
machine learning and neural networks are thoroughly examined and suc- 
cessfully used. It is also possible to use a fuzzy system for classification with 
rules like 



if x\ is /xi and X2 is /X2 and . . . and Xn is fin 
then pattern [x\,X2, . . . , Xn) belongs to class z, 

where the //i, . . . , are fuzzy sets. 

What is the advantage of having another method for classification? A 
fuzzy classifier is not a replacement for the aforementioned methods yielding 
better results, but a different way of achieving the same goal. If a decision is 
made for a fuzzy classifier usually the following advantages are considered: 

• vague knowledge can be used, 

• the classifier is interpretable in form of linguistic rules, 

• from an applicational view the classifier is easy to implement, to use 
and to understand. 

The rule base of a fuzzy classifier that uses rules like mentioned above, 
represents an approximation of an (unknown) function ip : R” {0,1}"" 
that represents the classification task where (^(x) = (ci, . . . ,Cm) such that 
Ci = \ and Cj = 0 (j G {1 , . . . ,m}, j 7^ z), i.e. x belongs to class Ci. Because 
of the inferences process, the rule base actually does not approximate p but 
the function p' : [0? l]”^* We can obtain <p(x) by (/9(x) = 'ilj{p'{x.))^ 

where -0 reflects the interpretation of the classification result obtained from 
the fuzzy classifier. Usually the class with the largest activation value is 
chosen. 

Classifiers are usually derived from data and are not specified directly. 
In case of a fuzzy classifier there are two common methods: fuzzy clustering, 
and neuro-fuzzy learning. 

In case of fuzzy clustering the input space is searched for clusters. The 
number of clusters is determined by an evaluation measure, and the size 
and shape of the clusters is given by the clustering algorithm. The obtained 
clusters can function as a fuzzy classifier, in fact, there is no need to express 
the classifier by rules. However, in this case the interpretability is lost, and 
therefore fuzzy rules are sometimes created by projection of clusters. This 
can cause a loss of information, because a fuzzy rule created by projection 
does not represent the cluster exactly, but only the smallest encompassing 
hyperbox. The performance of a fuzzy classifier obtained by clustering is 
therefore usually reduced, once it is expressed in form of fuzzy rules. In 
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Figure 1: A NEFCLASS system represented as a 3-layer neural network 
addition the rules are often hard to interpret, because the resulting fuzzy 
sets can have almost any shape. 

Another method to obtain a fuzzy classifier from data is to use a neuro- 
fuzzy approach. This means the classifier is created from data by a heuristic 
learning procedure. If the neuro-fuzzy approach meets the five points listed 
in Section 1, then the interpretability of the resulting classifier, and a ac- 
ceptable performance might be obtained. A neuro-fuzzy approach is also 
often less computationally expensive than a clustering approach, because of 
its simplicity. 

A neuro-fuzzy classifier simply a fuzzy classifier obtained by a learn- 
ing procedure, as described in Section 1. There are several known ap- 
proaches to find a fuzzy classifier this way, e.g. FuNE [Halgamuge and 
Glesner, 1994], Fuzzy RuleNet [Tschichold-Giirman, 1997], or NEFCLASS 
[Nauck and Kruse, 1997a; Nauck and Kruse, 1997b]. After creation of the 
classifier there is usually no hint on how it was derived. Therefore the term 
“neuro-fuzzy” strictly only applies to the creation or training phase. Once 
the classifier is applied and not changed further, a simple fuzzy system for 
classification remains. However the term “neuro-fuzzy classifier” is usually 
kept, to stress the mode of obtaining the classifier. The interpretation of 
the classifier is also often illustrated by representing it in a neural network 
structure, like e.g. NEFCLASS in Fig. 1. 

We have presented our NEFCLASS model (neuro-fuzzy classification) in 
[Nauck and Kruse, 1997a], and some refined learning algorithms in [Nauck 
and Kruse, 1997b]. Details about the model can also be found in [Nauck 
et al., 1997]. In this paper we only want to discuss the general idea of 
the learning process. For the following descriptions consider the illustration 
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Figure 2: Left: Situation after 3 fuzzy classification rules have been cre- 
ated using initial membership functions. Right: Situation after training the 
classifier, i.e. modifying the membership functions 

given in Figure 2 

The rule learning algorithm needs an initial fuzzy partitioning for each 
variable. This is e.g. given by a fixed number of equally distributed triangular 
membership functions (see left part of Fig. 2). The combination of the fuzzy 
sets forms a “grid” in the data space, i.e. equally distributed overlapping 
rectangular clusters. Then the training data is processed, and those clusters 
that cover areas where data is located are added as rules into the rule base 
of the classifier. In a next step this rule base is reduced by just keeping the 
best performing rules [Nauck and Kruse, 1997a]. The result after this stage 
of training can e.g. look like the situation on the left hand side of Fig. 2. 

After the rulebase has been created, the membership functions are tuned 
by a simple heuristic. For each rule a classification error is determined, and 
used to modify that membership function that is responsible for the rule 
activation (i.e. delivers the minimal membership degree of all fuzzy sets in 
the rules antecedent). The modification results in shifting the fuzzy set, and 
enlarging or reducing its support, such that a larger or smaller membership 
degree is obtained depending on the current error. The learning result might 
look like the situation on the right hand side of Fig. 2. 

To obtain an interpretable classifier some restrictions can be specified by 
the user. The NEFCLASS software [Nauck et al., 1997] allows to impose 
the foilwing restrictions on the learning algorithm: 

• a membership function must not pass one of its neighbors, 

• a membership function may be asymetrical, 

• membership functions must intersect at 0.5. 




292 



After learning pruning methods [Nauck and Kruse, 1997b] can be used 
to reduce the number of rule and variables in the neuro-fuzzy system. A 
complete description of the learning algorithm can be found in [Nauck et ah, 
1997]. 

An example that describes the performance of the learning procedure is 
presented in the following section. 

3 Learning Classifiers with NEFCLASS 

As an example for the learning capabilities of NEFCLASS we use the “Wis- 
consin Breast Cancer” data set [Wolberg and Mangasarian, 1990]. This data 
set contains 699 cases distributed into two classes (benign and malign). We 
used only 683 cases (342 for training, 341 for testing), because 16 cases have 
missing values. Each pattern has nine features. 

To show how NEFCLASS performs when prior knowledge is supplied, we 
used a fuzzy clustering method to obtain fuzzy rules. We used a modifica- 
tion of the algorithm by Gustafson and Kessel [Gustafson and Kessel, 1979] 
presented in [Klawonn and Kruse, 1997]. Fuzzy clustering discovered three 
clusters that were interpreted as fuzzy rules by projecting the clusters to 
each dimension and finding trapezoidal membership functions that closely 
matched the projections. The resulting three rules caused 94 classification 
errors on the whole data set. 

Now we matched the membership functions obtained by projection with 
linguistic labels small, medium and large. By this we got the following rule 
base (s = small, m = medium, 1 = large): 

Ri’. if (s,s,s,s,s,s,s,s,s) then benign, 

i?2: if (m,m,m,m,m,l,m,m,s) then malignant, 

Rs‘. if (m,l,l,l,m,l,m,l,s) then malignant. 

After we initialized a NEFCLASS system with these three rules, we at 
first got 240 classification errors. However, after 80 epochs of training, using 
the constraint, that fuzzy sets must not pass each other, we obtained a result 
of only 50 errors altogether (92.7% correct). 

A neuro-fuzzy learning strategy is a tool to support the creation of a fuzzy 
classifier, but not to completely automate it. This means the user should 
supervise the learning process, and interpret the result. By analyzing the 
fuzzy sets obtained by training the three rules used as prior knowledge we 
found that the fuzzy set medium substantially overlapped with either the 
fuzzy set small or large for almost all variables. This can be seen as evidence 
that medium is superfiuous. 

Therefore we again trained a NEFCLASS system with “best per class” 
rule learning, allowing it to create four rules (two for each class). This time 
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we only used two fuzzy sets to partition the domains of each variable. After 
100 epochs of training NEFCLASS made only 24 errors on the complete set 



(96.5% correct). It has found the rules 






Ri: if 




then 


benign, 


R2: if 


(l,s,s,s,s,s,s,s,s) 


then 


benign, 


R3: if 




then 


malignant. 


R4 : if 




then 


malignant. 



If we additionally apply the constraint that the fuzzy sets must intersect 
at membership degree 0.5 we obtain 26 classification errors on the whole set. 

With the pruning techniques of NEFCLASS [Nauck and Kruse, 1997b] 
this rule base can be further improved. After pruning we obtained a classi- 
fication result of 6 errors on the training set and 20 errors on the test set. 
So we have again 26 classification errors, but now a much smaller rule base. 
The two remaining rules are: 

Ri '. If the variables 3,4, 6, 7 and 8 are all large, then class is malignant, 
R 2 '. If the variables 3, 4, 6, 7 and 8 are all small, then class is benign. 

The linguistic terms small and large for each variable are represented 
by membership functions which overlap with their neighbors at membership 
degree 0.5, and so also the fuzzy sets are nicely interpret able. 



4 Conclusions 

We have discussed neuro-fuzzy classification using our NEFCLASS model 
as an example. We consider a neuro-fuzzy method to be a tool for creating 
fuzzy systems from data. It supports the user, but it cannot do all the work. 
It is difficult to find a good and interpretable fuzzy classifier by a completely 
automated learning process. The learning algorithm must take the semantics 
of the desired fuzzy system into account, and adhere to certain constraints. 
The learning result must also be interpreted, and the insights gained by this 
must be used to restart the learning procedure to obtain better results if 
necessary. 

A fuzzy classifier, especially a neuro-fuzzy classifier, is only used, when 
interpretation, and the employment of (vague) prior knowledge is required. 
Fuzzy classification is not a replacement, but an addition to other methods 
like statistics or neural networks. The price for the interpretation of the 
classifier in form of simple fuzzy rules, and for the simple and computation- 
ally efficient learning algorithm might be paid by a classification result that 
is not as good as it could be if other methods are used. 

The example with the Wisconsin Breast Cancer data shows that NEF- 
CLASS can be used as an interactive data analysis method. It is useful to 
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provide prior knowledge whenever possible, e.g. by a initializing NEFCLASS 
with rules found by fuzzy clustering. The learning result of NEFCLASS can 
be analyzed, and the obtained information can be used for another run that 
yields an even better result. The user is able to supervise and interpret the 
learning procedure in all its stages. The software tool and the data that was 
used by us to obtain the results presented in this paper, can be found on 
our WWW server fuzzy.cs.uni-magdeburg.de. 
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Abstract: In this paper we show how to avoid unnecessary calculations 
and to save considerably the computational cost in a wide class of tree- 
based methods. So called auxiliary statistics, which enable to restrict 
handling the raw data, are introduced. Aside that, a fast splitting algo- 
rithm is outlined, which allows to recognize and avoid unnecessary split 
evaluations during the search of an optimal split. Relationships between 
the computational cost savings and properties of both a specific method 
and data are summarized. 

Key Words: Classification and regression trees, computational cost, aux- 
iliary statistics, fast splitting algorithm. 



1. Introduction 

The tree-growing methods of the data analysis (see a survey by Mola in 
this volume) constitute a family whose members have, besides method- 
ological differences, a number of common features. Each of these methods 
fits to the data a kind of a ^^piecewise modeF defined in the terms of the 
set of the terminal nodes of a tree. The trees are grown recursively. Each 
branching is a result of an extensive testing of a number of candidate par- 
titions (splits). During the search of the optimal split, typically, analogous 
calculations are performed many times with one and the same portion of 
the data. 

Our interest in the tree-based methods of data analysis, which dates 
back several years, was primarily methodological. However, the experience 
with computer implementation has attracted our attention also to the 
questions of computational effectivity and computational enhancements. 

We believe that many interesting ideas of computational enhancements 
may have been implemented. Unfortunately, most of them are not de- 
scribed and documented (a rare exception being to find in Breiman at ah, 
1984 - see Section 4 of this paper). 

In this contribution, we would like to advertise the following principle: 
Do not compute what need not be computedl 
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Section 2 recapitulates some basics of the tree-based methodologies. 
Main results are contained in Sections 3 and 4, where our proposals of 
computational enhancements are outlined. While in Section 3 we deal 
with so called auxiliary statistics - a tool that allows to restrict calcula- 
tions from the raw data, Section 4 is devoted to the most straightforward 
demonstration of the above principle - the fast splitting algorithm. This 
algorithm enables to recognize that some computations are unnecessary 
and to avoid them. Computational cost savings resulting from the ideas 
outlined in Sections 3 and 4 will be discussed in Section 5, together with 
a concluding remark. 



2. Basics and notation 

One of the most important common building blocks of almost all of the 
tree-based methods is the task of finding the best split for a node of a tree 
over a set of candidate splits. Let us recall some basics and introduce 
necessary notation related to this procedure. 

Throughout the paper, we shall confine our interest to the standard data 
structures and splits following Breiman et al. (1984). At the same time 
we shall explicitly deal only with binary splits, though much of the ideas 
presented in the paper can be easily adapted for the methods generating 
ternary or, more generally, n-ary splits with n > 2. 

Each case I in the data set (learning sample) is assumed to have its 
values of predictors (left upper index will be left 

out whenever there is just one predictor of interest at the moment). 

We shall use symbol C with possible arguments and/or subscripts for 
sets of cases. The size of set of cases C will be denoted |£|. By C{t) we 
denote the set of those cases that belong to the node t of the tree. The 
left and right “sons”, respectively, of t will be denoted ti and tR. Recall 
that C{tL) and C(tR) are disjoint, C{t) being their union. 

Let X{t) be the set of all values that predictor X takes on in node t, i.e. 
in C{t). Predictor X defines for each node t a set of splits S{X,t), each 
split s being given by a standard question. For a categorical predictor 
X standard questions are of the form “Is the value of X in R?” , where 
B is any proper nonempty subset of X{t). For a numerical predictor 
X, questions “X < c?” are used for a finite set of cutpoints c, e.g., the 
midpoints between the consecutive ordered values from X(^). The positive 
and negative answers, respectively, define the left and right “sons” ^l(^) 
and tR^s) corresponding to the given split s. 

Search of the best split for node t is (or may be reduced to) maximiza- 
tion of a kind of statistic (j){s,t) - the splitting criterion - over the set 
Ut=i of splits based on all predictors. 
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3. Auxiliary statistics: Minimization of handling raw 
data 

A naive way of computing the splitting criterion for candidate splits within 
a node t is to start calculation for each split “from scratch” using the 
raw data that ''belong'' to i. Many splitting criteria allow, nevertheless, 
utilization of partial results, so called auxiliary statistics (Klaschka and 
Antoch, 1997). By the auxiliary statistic for a set of cases C we mean 
such a structured set a{C) of numbers (a vector, a matrix, a list of vectors 
and/or matrices, etc.) that: 

— it is computationally cheap, using the auxiliary statistics a{C\) and 
a{C 2 ) available for disjoint sets of cases C\ and £ 2 , to calculate a{C\ U 
£2); 

— analogously, for £2 C £ 1 , it is easy to calculate a[C\ \£ 2 )? using o:(£i) 
and a (£ 2 ); 

— it is computationally cheap to calculate the splitting criterion ^( 5 , 
from the auxiliary statistics a[C[tL{s))) and q:(£(^h( 5))) for the left 
and right "sons" tL{s) and tR{s) of the node t. 

Some simple examples of auxiliary statistics are as follows. 

• In the classification trees (Breiman et ah, 1984), the vector of frequen- 
cies of those cases from a set of cases £ that are a priori classified into 
given classes Ci, . . . , (7^, can be used as an auxiliary statistic assigned 
to £. 

• The splitting criterion (/>(s, t) in the least squares regression trees (ibid.) 
with dependent variable Y can be expressed as 

Y (y‘-yLf+ Y {yi-yR^ ^ 

lec{tL(s)) l€c{tR(s)) 

where I//, and yj^ are means of Y within C{t]j{s)) and £(tje(s)), respec- 
tively. Clearly, the triple consisting of |£|, Y^,iecyf 

be used as an auxiliary statistic for £. 

• When the model of multiple linear regression Y = X (3 e is fitted 

within nodes by the least squares method and the splitting criterion is 
defined as 

(f>{s,t) = B.Ss[c{tL{s))) + RSs{^C{tR{s))), 

RSS{C) being the residual sum of squares in £ (compare Ciampi, 1993), 
then the triple consisting of the matrix X'X, the vector X'Y^ and 
the scalar Y'Y can be used as an auxiliary statistic. It follows from 
the fact that the residual sum of squares can be expressed as RSS = 
Y'Y - {Y'X){X'X)-{X'Y). 
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Analysis of more sophisticated auxiliary statistics utilizing QR- factori- 
zation, following up previous results of Antoch and Ekblom (1995), is in 
progress. 

Usage of auxiliary statistics depends on the type of predictor : 

• For a categorical predictor X, the set of cases C{t) is partitioned into 
subsets Cx = {l ^ C{t)\xi — x},x G X(i), and auxiliary statistics 
a{Cx) are computed from the raw data for all these subsets. For any 
split s based on X, the auxiliary statistics a[C(tL{s))), 

are calculated by ^'coinbining^^ the pre-calculated auxiliary statistics 
^ -^(0- The splitting criterion (j){s,t) is then calculated, 
using a{C{f.L{fi))) and a{C{tR{s))). 

• For a num.erical predictor X, especially when many cutpoints are used 
within node i, we propose another procedure. Splits are processed in 
the natural order, so that cutpoints increase. For the leftmost cutpoint 
Cmin, auxiliary statistics a{{l G C{t);X < c,nin}) and a{{l G C{t)\X > 
Cmin}) are computed from the raw data. For any other cutpoint c, the 
auxiliary statistics a{{l G C{t)\X < c}) and a{{l G C{t)\X > c}) are 
recalculated from: 

— the stored auxiliary statistics for the preceding cutpoint c' < c, 
namely a{{l G J0{t);X < c'}), a{{l G C{t);X > c'}), and 

- cc{{leC{ty, c' < X < c'}), which is computed from scratch. 

Auxiliary statistics are, of course, available only for some splitting crite- 
ria. We may also meet a situation ^^somewhere in between^^ . For example, 
in the least absolute deviation regression trees (Breiman et ah, 1984) me- 
dian and the sum of absolute deviations from median “have nearly” the 
properties of an auxiliary statistic. To recalculate these statistics for union 
of two data subsets requires some handling the raw data, but much less 
than a complete calculation from scratch. 

Finally, we should mention, that the auxiliary statistics represent rather 
a natural idea than a “trick”. Therefore, it is quite possible that they are 
used in many existing tree-growing programs. However, we -have not found 
any support of such a hypothesis in the literature. 



4. Fast splitting algorithm: Omitting calculations for 
splits known to be nonoptimal 

Undoubtedly, it is useful to recognize at time that some splits are nonop- 
timal, and the related calculations can be omitted. 

Breiman et al. (1984) present a classical example of such a utilization 
of knowledge. In the two-class classification problem in CART, for a broad 
class of splitting criteria, the search of the best split among the 2^”^ — 1 
splits based on a Ai-valued categorical predictor X can be confined to 
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a subset of A; — 1 splits only. The values xi, . . . , of X can be ordered into 
a sequence X(i), . . . , so that the relative frequencies of cases classified 
into the first class (of the two) among the cases with X = X(i), i = 
1, . . . , fc are nondecreasing. Only those splits that for some j put cases 
with X G {^:(i), . . . into one node (while the rest into the other 

node), are candidates of optimality (see Theorem 4.5 in the Breiman’s 
book). A similar result concerning the least squares regression trees in 
CART can be found in Breiman (1984), Proposition 8.16. 

Mola and Siciliano (1996, 1997) have suggested (and in several pro- 
grams implemented) another result of that kind - so called fast splitting 
algorithm. Its idea is as follows. For various splitting criteria 4> it is pos- 
sible to define a statistic V^(X, ^) that is a common upper bound for the 
values (/)(s, t) for all splits s based on a single predictor X within a node t. 
When we already have such a split sq that the inequality (j){sQ^ t) > 'ip{X, t) 
holds for a predictor X, we know that none of the splits in 5(X, t) can be 
the best one for the given node; therefore we can omit calculations of the 
splitting criterion for all these splits. 

The algorithm first sorts the predictors into the sequence . . . ,^^^X 
so that the corresponding sequence i/;(^^^X, t), . . . , t/^(^^^X, t) is nonincreas- 
ing. Then all the splits based on (^^X,(^)X etc. are evaluated in turn, 
until for some j the maximum 0-value found among the splits based on 
(^^X, . . . ,(^)X is higher than 0(^^^^^X). Since it is clear that none of the 
yet unevaluated splits based on . . . ,^^)X can be optimal, the search 

can be stopped, and the best split found so far is the best one for the given 
node. 

The algorithm should be applied to categorical predictors with at least 
three values, and to numerical (ordered) predictors with only a few (but 
more than two) values. (It is evident, that for a binary predictor X, which 
defines just one split, the possible computational cost savings related to 
0(s,^) are lost in calculation of 0(X, t). For numerical predictors with 
many values the 0- values become computationally too costly.) 

The upper bound 0 mentioned above is very easy to construct for 
a broad class of splitting criteria (see Mola and Siciliano, 1998, and Klasch- 
ka and Antoch, 1997), covering, e.g. those of: 

— the least squares regression and the least average distance regression in 
CART (Breiman et al., 1984); 

— classification trees with the splitting criterion based on the Gini coefii- 
cient (ibid.); 

— the methods fitting in each node of the tree a specific multivariate 
model, using the least squares method or the maximum likelihood 
method (see, e.g., Ciampi, 1993). 

Maximization of 0(s,^) is often equivalent to maximization of criterion 
0'(s,^) = G[C(tL{s))) -\-G[C{tR{s))), where G is a '^goodness of fit func- 
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tion” such that G{C\ U £ 2 ) < G[Ci) + ^(£ 2 ) for any pair of disjoint sets 
of cases £1 and £ 2 . An upper bound of the splitting criterion <j)[s^t) over 
S{X^t) is then obtained in the following way: The set of cases £(t) is 
partitioned into subsets Cx = {I ^ C{t)]Xi = x}, x e X{t). The sum 
G{Cx) over x G A(^) is a common upper bound of values for 

s E <S(X, t), and it can usually be ^^transformed baci” into an analogical 
bound for the original criterion 0. 

It is often possible and, moreover, highly desirable to combine the use 
of auxiliary statistics and of the fast splitting algorithm, utilizing the same 
auxiliary statistics for calculation of both the upper bound and of the 
splitting criterion (j) (see the next Section). 

5. Computational cost savings 

The effect of the enhancements described in Sections 3 and 4 may be es- 
timated both theoretically and empirically. As regards the first approach, 
Siciliano and Mola (1996) described a part of their empirical experience 
with the performance of the fast splitting algorithm. As concerns the 
second approach, Klaschka and Antoch (1997) outlined some partial the- 
oretical considerations regarding the effect of the fast splitting algorithm. 
In this paper, we present more recent and more complete theoretical re- 
sults. They highlight the mutually complementary roles of the auxiliary 
statistics and the fast splitting algorithm, and show the virtues of the joint 
application of both enhancements. 

Our analysis will deal with a single categorical predictor X. The overall 
computational cost savings during the whole process of tree-growing are, 
of course, composed of these partial savings. We shall utilize a number of 
simplifying assumptions which are listed below. 

• We assume that X takes on the same number k of values in each of 
those nodes of the tree, where the search of the best split takes place. 
Thus, there are 2^“^ — 1 candidate splits based on X in any of these 
nodes. 

• We assume that within a node with n cases: 

— the cost of all the auxiliary statistics a{Cx), x e X{t) computed 
from the raw data is cn, where c is a constant; 

— the cost of calculation of one value of the splitting criterion 0, and/or 
the cost of calculation of the upper bound -0, from the auxiliary 
statistics is equal to c 6; 

— the cost of calculation of one value of the splitting criterion 0, and/or 
the cost of calculation of the upper bound 0, computed from scratch 
is c(n + b). 

• We neglect the cost of recalculation of auxiliary statistics for £(^l) and 
C(tR) from the auxiliary statistics a{Cx), x e X{t), 
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• By 1 / we denote the mean number of cases in those nodes where the 
search of the best splits takes place. Notice that: 

— parameter v grows with the total number of cases in the data set; 

— the larger number of nodes is allowed to exist by the stopping rule, 
the smaller is 

— for a given total number of cases and a given number of nodes, it 
holds that the stronger is the tendency (of the method and/or of the 
data) to prefer extremely asymmetrical (^^end-cut^^) splits the higher 
is the parameter ly. 

• The fast splitting algorithm treats the splits based on X within a node 
in one of the two following manners. 

1. They are detected as nonoptimal by the fast algorithm, and the 
calculations of the splitting criterion for these splits are omitted. 

2. The splitting criterion for all these splits must be calculated. 

We assume that the choice between these two alternatives is random, 
and moreover, that the latter one has the same probability 7t(X) in 
every node. 

Under the above assumptions, the following formula for the relative 
computational cost savings (RS) attributable to the use of the auxiliary 
statistics (AUX) can be derived. (Subscript PLAIN stands for unen- 
hanced computation.) 

_ COStpLAIN - cost AUX ^ ( . 1 ^ 1 

cost plain \ 2 ^“^ — 1 / 1 + 

Thus, as we can see, the effect of using auxiliary statistics is increased by: 

— large number of values of categorical predictors; 

— high overall number of cases in the data set; 

— small number of terminal nodes (before possible pruning); 

— tendency to very asymmetric (^^end-cut^^) splits; 

— simple calculation of the splitting criterion from the auxiliary statistics. 
Under the same assumptions, the relative computational cost savings 

due to the application of the fast splitting algorithm (FAST) without 
using the auxiliary statistics can be expressed as 



COStpLAIN - COStpAST 

RSfast = ;;; « 1 - 7t(X) - 

COStpLAIN 



1 

2k-i _ 1 ■ 



The effect of the fast splitting algorithm grows, as we can see, with: 

— high number of values of categorical predictors; 

— the probability that the fast splitting algorithm omits calculation of the 
splits related to X. 
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Notice, that the global savings (for a// predictors) obviously depend on how 
the TT-values are distributed in the set of all the predictors used. A big 
amount of calculations is saved when there are only a few splits close to 
optimum and many ‘‘much worse^^ splits. 

Finally, the savings due to the combined use of the auxiliary statistics 
and the fast splitting algorithm {AUX.FAST) can be evaluated as 

DC cost PL AIN - cost AUX, FAST 

it^AUX.FAST — 

COStpLAIN 

^(l- ^ ^ - 2 ^^ 

V - 1 / 1 + b/u 1 4- I'/b 

It is easy to see that RSaux.fast = RSaux RSfast / (1 + ^/^)- From 
(1) it is clear that the factors behind the parameters b and v can eliminate 
the effects of either the auxiliary statistics or of the fast splitting algorithm, 
but not both. 

As the conclusion, we would like to stress that the two enhancements 
described in Sections 3 and 4 cover together well the broad spectrum of 
situations where the computational cost of the tree-growing may become 
high. Therefore, their joint use is highly desirable whenever it is possible. 
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Summary : Adding linear combination splits to decision trees allows 
multivariate relations to be expressed more accurately and succinctly than 
univariate splits alone. In order to determine an oblique hyperplane which 
distinguishes two sets, linear programming is proposed to be used. This 
formulation yields a straightforward way to treat missing values. Computational 
comparison of that linear programming approach algorithm with classical 
univariate split algorithms proofs the interest of this method. 

Key words : Oblique decision tree, missing values, linear programming. 



Introduction 

Classification and decision trees already belong to the way of current 
research and many methods still have to be explored [5]. Unlike classical 
decision trees which produce univariate trees, linear programming is proposed 
to be used to generate oblique decision trees ^ 

For many years, many methods have been shown to create ODT but finding 
best multivariate hyperplane is a NP complete problem. Murthy, Kasif and 
Salzberg [6] have proposed OCl, a solution based on impurity measures and 
perturbation algorithm. Other methods proposed by Mangasarian, Setiono and 
Wolberg [4] or Bennet [1] induce ODT using linear programming but these 
methods find the optimal linear discriminates by optimising specific goodness 
measures given by the authors. 

Instead of having an analytic approach, geometrical results of linear 
programming is used and no special measures are needed. Only data set 
containing two classes can be treated for the moment but an original solution to 
solve the problem of missing values is proposed. That gives the possibility to 
work with a medical data set and to compare the performances of this algorithm 
to C4.5 [7], a standard univariate decision trees builder. 



* {gmichel,lambert,cremilleux}@info.unicaen.fr 

* amar@baclesse.fr 

’ Trees using oblique hyperplanes to partition data are called oblique decision trees and noted ODT. 
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1 Linear programming : geometrical approach 

Two principal results of linear programming are essentially used in this 
work : the Simplex Algorithm and the Duality Theorem. Linear programming is 
an algebraic tool and so the data set has to be translated into an algebraic form. 

Assume, for the moment, that data set contains two classes, that all data are 
numeric and that there is no missing value. 

1 . 1 Global approach 

The data set can be projected in an 
Euclidean space, noted £, of dimension n {n 
is the number of characteristics describing 
each data and one data becomes one point in 
E ) and each class is represented by a cloud 
of points. The aim of learning, in this case, is 
to separate two classes ( or two clouds of 
points ) : in the case of the Figure 7, the 
hyperplane D divides x-class and y-class and 
allows to decide in which classes belong new points. As linear programming 
products linear borderline, dividing two clouds of points or their convex covers 
is equivalent. 

Let be C/ = {Ej,....Ep} and C 2 = {Fj,....Fg} with Et in for all / in [\...p] 
Convex cover of Cj and C 2 is obtained by introducing convex notions ( They 
are noted Con(Ci) and 
Con(C 2 ) ). The intersec- 
tion of Con(Cj) and 
Con(C 2 ) is null ( and a 
linear borderline can be 
found ) if and only if 
System 1 has no solution. 

In the other case, an other 
result of the linear 
programming gives a 
method for having good results. 

1 .2 Tools and methods 

The Simplex Algorithm and the Duality Theorem [3] are mainly used in 
these twice situations. 

1.2.1 A borderline exists 






>1 



with 



\/ie[\..p] A,>0 
V/e[l..^] /Uj>0 



System 1 : Intersection of convex covers. 




3a b :V/ g[1../7],V/' g[1..^] 
System 2 : System 1 dual 



[a‘E,<b 

\a^Fj>b 



Firstly, studying the 
case in which a borderline 
dividing convex covers 
exists (e.g. the situation 
look like at the Figure 1 ). 
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A linear borderline exists ( e.g. a vector a) if and only if the system 2 is 
verified. By noticing that the System 2 is the dual of the System 1 and using the 
Duality Theorem, following result can be proved : if the System 1 has no 
solution, the base of the dual is solution of the System 2. This base, easily 
extractable from the System 7, gives, for each dimension, the coefficient of the 
vector a ( and, of course, the equation of the hyperplane D ). 

1. 2. 2 No borderline exists 

But, most of the time, it not exists hyperplanes dividing classes and the dual 
has no solution. See the Figure 2 for having an example of this kind of 
situation. 



The Simplex Algorithm return a proof of 
the non existence of hyperplanes by giving a 
small set of points belonging to the both 
classes. The Figure 2 shows easily that : 
Con(Ci) nCon(C 2 ) ^0 and one proof can 
be given by the following points : 
{x 2 ,X 3 ,yi,y 2 } (because [X 2 ,X}] n [yi,y 2 ] ^0). 
In this case, the idea is to take out one point 
convex covers again. 

Logically, after m iterations, two distinct convex covers have to be obtained 
and results of § 1.2.1 may be applied. 

Those twice remarks allow to propose the following algorithm : 

1.3 A trivial algorithm 

Let C/ and C 2 , two sets of point in The procedure used to induce a split 
that divides two sets of points can be defined : Algorithm 1. The fimction that 
chooses one point in the intersection’s set 
is very important and not easy to define. 

For example, in Figure 2, it is more 
interesting to extract X 2 than xs,yi or y 2 . 

For the moment, the choice function is 
trivial and chooses a point randomly. It is 
certainly easy to find a better solution and 
this point should be studied in the future. 



2 Missing values treatment 

In this representation of data set, there is no place for missing value 
whereas it is very important to consider this kind of trouble highly present in 
real word data extracted. 



Algorithm 1 : Application 

Build system 
While Indivisible do 

Find intersection 
Choose point 
Extract point from system 
end while 

Extract hyperplane from system 




from this set and to consider new 
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2. 1 Linear Programming interpretation of missing values 

Algorithms usually propose [2] [8] to replace missing values by values 
extracted from the data set ( e.g. median value of the argument ) or to have a 
probabilistic approach [7]. 

The linear programming approach allows us to apply an other treatment. 
Instead of fixing missing values, they are replaced by variables. Limits are 
given to those variables to be sure that they have logical values. That means an 
expert has to define a maximum value and a minimum value for all dimensions 
in which there exists missing values. Sometimes, those values could be 
extracted from the data base by finding real maximum and minimum if this data 
base is representative enough. 

2.2 Geometric interpretation of missing values 

For understanding the approach, it 
is interesting to have a geometrical 
interpretation of this operation. In 
fact, A data having p missing values 
is replaced by an hypercube of 
dimension p ( e.g. an hypercube with 
2^ vertices). Studying, in the 
following case : A=(?,a) and B=(P,?) 
and have a look to the Figure 3. 

Constraints given by the linear 
programming approach are stronger 
than those given by the classical 
approaches and the choice for finding hyperplanes is more limited. It is even 
possible to have situations in which there exists hyperplanes in the first case 
whereas the intersection between the two hypercubes is not empty ( it can not 
exist hyperplane in this condition). That proves those approaches are not 
equivalent. The algorithm given in § 1.3 finds an elegant solution for this 
trouble because bad points are eliminated. 

2.3 Generalisation 

The algebraic form is generalised for C; and C^, two sets of Let be 
and two sets of n sets. For all i in [l..n], let jOi 

( resp. &i ) the set of indices of variables of C/ ( resp. C 2 ) for which the 
component is unknown. For example, r is in if and only if Xr is in C/ and Xrj 
is a missing value. Let , for all i in [l..n], /w, and M/ as the limits for the 
component of the data base ( read in the data base or given by an expert of the 
domain ). The system obtained ( System 5 ) is linear and linear programming 
algorithms may be applied to resolve it. 
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System 3 : Generalised system 

li j;,,. - - » 

r , fw A, -Z <0 

V/ s l..n VreO,J J ', 'i „ 

V/ e [l..j3pl, > 0 ; V/ € [l..^^ > 0 

wt fi Iw 



Notice that if all components are not numeric, it is always possible to 
transform them. For example, [Small,Medium,Big] becomes [1,2,3] and 
[Yellow,Blue,Red] is replaced with three binary components : [0,1] for Yellow, 
[0,1] for Blue and [0,1] for Red. This treatment is important and, at a rough 
estimate, can represent a negative point of this method because it could not be 
done automatically by algorithms and knowledge of experts is needed. But, 
with the view to have a semantic interpretation of attributes, only human users 
can be efficient. 



3 Computational results 

Before giving results of experiments that compare the performances of 
those algorithm to C4.5, the data set used has to be described. 

3 . 1 The Hodgkin’ s disease 

In this paper, reported results are issued from a data set collected by the 
Lymphoma Cooperative Group of the European Organisation for Research and 
Treatment of Cancer ( EORTC ) and provided by Dr. M. Henry-Amar 

The data set describes more than 3000 patients treated with various 
protocols. After treatment^, the data set currently has 824 entries for the 
learning data and 701 entries for the test data^. The patients, grouped as 
“Favourable” ( 369 cases ) nor “Unfavourable” ( 455 cases ), are described 
through 16 continuous attributes and three binaries attributes. The learning data 
set contains 330 missing values concentrated on five attributes. 

3.2 Comments about results form 

In this experience, extrema were extracted from the data base as described 
in § 2.1 and were supervised by Dr. M. Henry-Amar. Hyperplanes are given 



^ Different protocols products different descriptors for patients and choices have to be done. 
^ Those data sets are fixed like that for temporal reasons. 
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under the form explain in Table 1 : for each attribute of the data base, the value 
obtained in the dual is the coefficient of the vector a described in § 1.2.1. 





Table 1 .• 


Results form 




Coef. for age 


0.0093584466 


0 


Coef for sexe 


-0.005338922 


Coef for cbdfus 


0.0014359370 


0 


Coef for cbgfus 


0.0007233347 


Coef for axdfus 


0.0009392872 


0 


Coef for axgfus 


0.0003019576 


Coef for medflis 


0.1363806657 


0 


Coef for ext 


0.0887066056 


Coef for sg 


0.2534100762 


0 


Coef for vs 


0.0048398000 


Coef for hb 


-0.001241328 


0 


Coef for gb 


2.48528 le-05 


Coef for polflis 


0 


0 


Coef for lymfiis 


0 


Coef for monfus 


0 


0 


Coef for plaq 


0 


Coef for pa 
Coef for histforus 


-0.000123759 

0.0077649976 


0 


Coef for Idh 


7.475156 e-06 


Coef for lambda dua 


-0.411007586 


0 


Coef for mu dua 


: 0.4219074193 



In the case of the Table 7, the vector a is : 



a = (0.0093584466, - 0.005338922, . . 7.475 1 56 e - 06, 0.0077649976) and let be 

^ _ 0.0077649976 - (- 0.42 1 9074 1 93) 

2 

A point X belongs to “Favourable” if ax > b\ otherwise, it belongs to 
“Unfavourable”. 



Results are given under a form that allowed a semantic interpretation. For 
example, the attribute polfiis is unneeded to classify because its coefficient is 
null and the important value of the age’s coefficient means that the youngest are 
more concerned by the disease than the oldest. Those information are well 
known from medical experts but others hypotheses are confirmed by this way. 

3.3 Results of the linear programming classifier 



For having results, just three oblique hyperplanes are extracted ( That 



means that the work space is 
devised in four regions with 
three linear borderline ) 
whereas C4.5 use more than 
20 rules. 




Total 

88 ?^ 

94.3% 



The results obtained are under those obtained by C4.5 but they are very 
interesting. With few hyperplanes ( Three ), results obtained on the test data 
base prove that the method described bellow is valid even on a real world 
Class 1 Total extracted data base (C4.5 

85.3% results show the nontrivial 

87.4% 89.6% partition of the Hodgkin’s 

disease data base ). 




Notice that speed of linear programming algorithm, particularly adapted for 
this kind of situation, allows to obtain results very easily on common computer. 
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4 Conclusions 

This paper has described a linear programming method to construct linear 
borderline speedily and easily. A new way to exploit linear programming in 
classification has been presented and an original treatment of missing values 
has been introduced. The implementation proposed was tested on a consistent 
data set extracted from the medical domain and interesting results were 
obtained : good hyperplanes that generalise very well were got. 

But better results may be obtained by trying those following ideas : build a 
complete decision trees with linear borderline, define a clever choice function 
( § 1.3 ), introduce the extracted elements in an other order, generate complete 
decision trees, use some pruning methods, test the algorithm with cross 
validation and other data sets, generalise to multi classes decision trees. These 
extensions are currently explored. 
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Abstract:. The recent interest for tree based methodologies by more and more 
researchers and users have stimulated an interesting debate about the 
implemented software and the desired software. A characterisation of the 
available software suitable for so called classification and regression trees 
methodology will be described. Furthermore, the general properties that an 
ideal programme in this domain should have, will be defined. This allows to 
emphasise the peculiar methodological aspects that a general segmentation 
procedure should achieve. 
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1. Introduction 

During last ten years the interest about the tree based methodologies in statistics 
increased considerably. This is not surprising since a tree diagram, together 
with the information about its construction and node characteristics, provide a 
powerful summary of information contained in a data set. Typically, most of the 
known algorithms provide either the prediction of an a priori class 
{classification trees) or the expected value of a response variable {regression 
trees) and at the same time, they can also define a tree-structured predictor of 
the parameter(s) of a distribution, which can be incompletely specified and can 
contain nuisance parameters {generalised trees). For basic information and 
statistical background see referenced papers and monographs; this paper 
concentrates especially on the computational side of the problem. In the 
following we describe briefly some criteria which are, according to our point of 
view, the most important for the proper choice of the software. Considering 
these criteria, we analyse the most important programs/packages 
commercialised or free distributed. We concentrate (in concise manner) 
especially on the weak and strong sides of typical representatives of available 
software. In this paper, for the sake of brevity, we report only the main 
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characteristics of each analysed software referring to the literature for 
methodological and technical details. In particular, we consider a general 
framework of the desired (or ideal) software. This is basically characterised by 
main phases that can be sorted in different ways depending on the nature of the 
variables, the adopted criteria, the desired output and the complementary use of 
other approaches such as factorial analysis and statistical modelling. 



2. Criteria 

In this section we present the main criteria for the proper choice of the software 
suitable for this type of statistical analysis. First group of parameters covers 
statistical properties, second one basic characteristics related to the programme 
control and the third one concentrates on characteristics related to its 
distribution and overall support both from the developers and both the 
distributors. 

In the first group of characteristics, called basic characteristics related 
to the methods implemented, we consider the characteristics related to the 
statistical aspects of the methodologies and the quality of the outputs. Among 
the other parameters, we analyse which type of data one can deal with, the way 
how missing data are treated in the analysis, the types of methodologies 
implemented, the splitting procedure and splitting rules, the stopping rules and 
the pruning procedures implemented. Particular attention is paid to the 
possibility of automatically classify cases of unknown classification and the 
possibility to follow the same cases along the tree structure defined by the 
classification rule obtained. Concerning the results, we consider characteristics 
like the overall quality of the presentation of results, the numerical and 
graphical presentation of results, the presence or absence of interactive 
graphics, the possibility to export graphics and results in other environments. 

A second important group is called basic characteristics related to the 
programme control. In this group we have enclosed all characteristics related to 
the informatics aspects, that is the way how to use the software. Here are 
considered characteristics as the input/output facilities; the ease of the use; the 
way of control; the error indication; the possibility of recovery after errors and 
the quality of the on-line help. 

The third and last group of characteristics are basic characteristics 
related to the distribution and overall support. In this group we consider, 
among others, these characteristics: 

• the type of operating system(s) the software is available for; 

• whether the package is a stand alone package, part of the larger 
project or part of some general purpose software; 

• whether it is a supported software; 

• whether it is a software under continuous development, 
improvements and innovation; 

• whether the possibility of user configurability is available. 
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3. Analysed software 

Currently, more than ten commercial programs/statistical systems and several 
shareware programs offer (up to some extend) the methodology of classification 
and regression trees. Among them, the following are (from different reasons) of 
special interest, i.e. ID3, AnswerTree, C4.5, CART, CHAID, FIRM, RECPAM, 
SICLA, S-Plus, Spad.S, SYSTAT, TWO-STAGE and UNAIDED. Some of 
these packages (namely IDS, CART, S-Plus, SPAD.S, and Systat) refer to the 
well known and widely used binary segmentation methodologies, while other 
packages refer to some not so widely used binary segmentation procedures 
introduced in the literature by several authors during the last years. 

IDS software performs the AID methodology (Morgan and Sonquist, 1963). 
AID attempts to fill a gap overlooked by regression, discriminant analysis and 
logistic regression through identifying important interactions among the 
predictors. AID also provides a visual display of results in tree diagram form, 
where file segments are represented by the terminal tree branches and are 
accompanied by the associated response rates. While the AID Tree Analysis 
technique has been shown to generally discriminate between good and bad 
respondents and interactions, it has also been shown to contain a considerable 
amount of errors. Limited to dichotomous splits, the AID technique's primary 
inherent problem is a bias toward the use of variables containing many 
categories over variables with few, or just two, levels. AID does not properly 
adjust for the likelihood of finding a dichotomization by chance due to 
sampling variability and, therefore, it is biased in favour of variables having 
many categories because there are more ways to dichotomise such variables. 
AID segments tend not to validate well in rollouts-response rates, because the 
best responding segments tend to be lower (sometimes substantially) than 
predicted. 

After the module CHAID implemented in SPSS since four years, SPSS 
released recently more algorithms, i.e. CHAID, Exhaustive CHAID, QUEST 
and CART (most of them are characterised elsewhere in this paper) and offers 
them under the name AnswerTree. Important feature of this module is that 
AnswerTree saves the segment identifier as a new variable for the use in 
subsequent analyses using SPSS. Moreover, you can easily extract data from a 
database for selected segments to generate a mailing list or new data set using 
the model's decision rule(s). 

The program C4.5 is one of a series of programs written by Quinlan and his 
colleagues in the field of decision trees. The basic idea is to generate all 
possible decision trees that correctly classify the training set and then to select 
the simplest of them. The program is designed for the situation when there are 
many attributes and the training set contains many objects, but where a 
reasonably good decision tree is required without much of computation. For the 
evaluation of trees and their validation, the authors do not use classical 
statistical approaches (as do most of the other programs and methodologies 
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mentioned in this paper), but the measures from the field of information theory 
are adopted. 

CART is a software that implements the binary segmentation procedure 
proposed by Breiman et al. (Breiman et al, 1984). As known, CART is one of 
the most important methodologies for binary segmentation. Two basic ideas 
characterise CART methodology: the definition of an impurity function at each 
node (and consequently the definition of the splitting criterion based on an 
impurity measure), and the pruning process to define honest size trees. Since 
this is probably the best known software in this domain, we leave place for less 
known packages. 

Developed by Kass in late seventies, CHAID represents a statistically 
appropriate approach to the tree analysis. It is a chi-squared technique and 
requires all variables to be categorical. Like AID, CHAID is multivariable 
rather than a multivariate technique, and thus does not require that a full 
multiway table be formed during the analysis. CHAID is well suited as: 

1 . A replacement for AID in identifying those segments that differ in response. 

2. A variable selector to screen out extraneous predictor variables. 

3. A category collapsing technique that reduces the number of cells in a full 
multiway table. 

Unlike AID, CHAID is not limited to dichotomous splits. 

FIRM, developed by D. M. Hawkins in late eighties is parallel especially to the 
CART and the AID, however, it differs from them in several aspects, notably in 
varying the number of descendant nodes into which different nodes are split, 
and of using conservative formal (Neyman-Person) statistical inference for 
determining when to terminate analysis at each node. Further differences 
include: 

1. A facility for handling predictive missingness, where the fact of a predictor 
being missing conveys some predictive information about the dependent 
variable. 

2. The predictors are either on the nominal or the ordinal scale. 

3. The dependent variable may be either on the categorical or on the interval 
scale of measurement. 

RECPAM is essentially a method of growing generalized trees on the basis of 
generalized information measures, which add to the traditional RECursive 
partitioning steps of tree-growing an AMalgamation step; this step results in the 
identification of groups both homogeneous and distinct with respect to the 
parameters to predict. RECPAM is therefore a family of algorithms (and 
programs as well). Perhaps more precisely, it is a research programme which 
aims at applying the principles underlying tree-growing to a variety of 
situations occurring in data analysis, in which the information to be presented in 
a tree is not as simple as in the classical CART algorithm. Recent developments 
are RECPAM-GLIM and RECPAM-COX, which are the methods for treating, 
respectively, data with a response variables that can be modelled by a 
distribution of the exponential family, and data containing a censored survival 
times. 
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At the moment there does not exists a module suitable for the recursive 
modelling in SAS. However, recently SAS Inc. asked users of its system for the 
opinion whether and when to include such a module. Therefore, it seems highly 
probable that the methodology of Classification and Regression Trees type will 
appear soon within the SAS as well. 

SICLA (Systeme Interactive pour CLAssification) is a programme oriented to 
the clustering and classification. In the SICLA software, the classification trees 
procedure DNP (Decoupage Non Parametrique) performs a classification tree 
analysis according to the methodology proposed by Celeux and Lechevallier in 
the 1982 and based on the Kolmogorov-Smimov distance. Recent development 
in SICLA concerns the pruning process in binary segmentation; in fact a 
procedure (called bonsai) enabling to search for optimal trees was 
implemented. 

SPAD.S (SPAD Segmentation) is an optional module of the package SPAD.N. 
The tree-based methodology implemented in SPAD.S is basically the CART 
methodology of Breiman et al (1984) with several improvements suggested by 
J. P. Nakache. SPAD.N (and obviously also SPAD.S) is implemented in the MS 
Windows environment; this allows the users to use all the input/output facilities 
offered by the MS Windows. SPAD.S offers very detailed outputs as concern 
both the splitting and validation phases. 

In the S+ package, mixture of CART and RECPAM methodologies is 
implemented. Some interesting enhancements are relative to the possibility to 
choose among different pruning procedures. The procedure, however, is not 
well documented in either the manuals or books describing the use of S+. 

The TREE module of SYSTAT v. 8.0 computes classification and regression 
trees according the approach suggested by Breiman et al. TREE procedure 
produces numerical and graphical (called mobiles) results. Algorithm used is 
that from Breiman et al. for the splitting computations. Unfortunately, neither 
pruning nor amalgamation step is implemented, so that the practical usage is of 
limited value. Missing data are eliminated from the calculation of the loss 
function for each split separately. 

The binary segmentation methodology proposed by Mola and Siciliano in 1992 
is implemented in the shareware software called TWO-STAGE. This software 
(running in MS Windows environment) performs the classical CART procedure 
with the fast splitting algorithm (Mola and Siciliano, 1997). As stopping rule 
the CATANOVA test is considered (see Mola and Siciliano, 1994). 

The UNAIDED software developed by Capiluppi et al. (1997) is a software 
for binary and not binary segmentation according the approach suggested by 
Morgan and Sonquist (1963). All the facilities offered by the MS Windows 
environment are available. The possibility of considering the criterion variable 
on any measurement scale is available, too. Aside that, so called look-ahead 
analysis is available. 
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4, How should like look an ideal programme system for 
classification and regression trees methodology 

Taking into account the above considerations about the existing software, let us 
try to imagine an ideal software for the classification and regression trees 
methodologies. We are involved in a project with the aim of realising such a 
software. 

As first point let us take a look to some computational aspects of the 
classification and regression trees procedure. It is well known that many steps 
of the classification and regression trees programs are the same for all proposed 
procedures, independently of the splitting and pruning algorithm adopted. Let 
us consider at first the splitting algorithms. To split a node into the left and right 
offsprings, there are some computational efforts that cannot be "bypassed". We 
refer in particular to the seek, at each node, of the set of dichotomous (in case of 
binary segmentation) questions (splitting variables) generated by each 
predictor, the cross-classification of predictors and response variable, the class 
assignment, the description of the terminal and non terminal nodes, the 
graphical visualisation of results and so on. In other words, we can see it is 
possible to reduce the splitting algorithm into small phases, making it possible 
to offer the user with a general tool to grow binary and not binary trees by 
choosing not only different splitting rules (something already existing, see e.g. 
Steinberg and Colla, 1984, Capiluppi et al., 1997, AnswerTree, 1997) but also 
different splitting criteria. In other words, the researcher could have available 
some informatic tools and the possibility to paste them building up the desired 
classification and regression trees splitting procedure. We think that this is a 
very crucial point, since the possibility to analyse data sets with different 
splitting procedure is very important because the comparison of results of 
different splitting criteria is important for the researcher (some discrepancy in 
the results could reveal important aspects of the data). Another important 
aspect, related with this point, is to give the possibility for the researcher to 
implement a self-made splitting procedure suitable for particular data structure. 
Concerning the choice of the best tree (either using the pruning approach or 
pruning and amalgamation approach, either using statistical tests or user 
defined thresholds), many phases are common for different methodologies. 

Another important point is the possibility of analyse data sets with 
different statistical methodologies at each step of the growing tree process, i.e. 
descriptive statistics (some package do it), visualisation of dependency by 
factorial methods (Mola and Siciliano, 1997) and statistical modelling (Mola, 
Klaschka and Siciliano, 1996). This means to have the possibility to consider 
each node of the tree as a sample to be investigated. When we are dealing with 
exploratory trees, this point is very important. 
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5. Conclusions and remarks 

Classification and regression trees is present as methodology in the most 
important general purpose statistical software. This is an important result if we 
think that the methodological developments of this kind of methodologies are 
recent. The efforts of many researchers in developing new and alternative 
algorithms give evidence of the strong interest about segmentation, either for 
splitting process (Steinberg, 1996; Capiluppi et al., 1997; Klaschka and Mola, 
1998; Klaschka et al., 1998) either for pruning process (Cappelli et al, 1998). 

On the other hand, we also think that tree based methods cannot strongly be 
enveloped in general purpose system; researchers who like investigate data 
needs for flexible tools and not black boxes. It is also true that only 
commercialised software, for economical reasons, can guarantee the 
development of software for this domain of research. At the end a question still 
remain opened: “To perform classification and regression trees, is it better a 
specialised software or a procedure inserted in a general purpose software?” 

Acknowledgements: The work has been partially supported by MURST 40%. 
Author would like to thank to Prof J. Antoch for thorough discussions which 
improved considerably this paper. 



References 

AnswerTree (1997). SPSS-AnswerTree, SPSS Inc., Chicago, IL„ USA. 

Breiman, L., Friedman, J.H., Olshen, R.A: & Stone, C.J. (1984). Classification 
and Regression Trees, Wadsworth, Belmont, CA. 

Capiluppi C., Fabbris L. & Scarabello M. (1997). UNAIDED: a PC system for 
Binary and Ternary Segmentation analysis. Book of Short Papers of 
Classification and Data Analysis, Pescara. 

Cappelli, C., Mola, F. & Siciliano, R. (1998). An alternative pruning method 
based on impurity. In Proceedings of COMPSTAT'98, Bristol August (to 
appear) 

CART (1984). CART: A software for classification and regression trees, 
California Statistical Software Inc., Yorkshire Ct. Lafayette, CA, USA. 

CHAID (1996). SPSS-CHAID, SPSS Inc., Chicago, IL, USA. 

Ciampi, A. (1994). Classification and discrimination: The RECPAM approach. 
In COMPSTAT94, Dutter R. & Grossmann W. eds., 139-147, Physica 
Verlag, Heidelberg. 

Clark, L. A. & Pregibon, D. (1992). Tree based models. In Statistical Models in 
S, Chambers J.M. & Hastie T.J. eds., Wadswoorth and Brooks/Cole. 

FIRM (1990). FIRM: Formal inference-based recursive modelling by D.M. 
Hawkins (for IBM PC), University of Minnesota, School of Statistics, 
Technical Report 546. 




318 



Hormarm, A. et al. (1990). Comparing statistical analysis systems, Statistical 
Software Newsletter, 16, 90-127. 

Klaschka, J. & Mola, F. (1998). Minimization of computational cost in tree- 
based methods by a proper ordering of splits. In Proceedings of 
COMPSTAT'98, Bristol August (to appear). 

Klaschka, J., Siciliano, R. & Antoch, J. (1998). Computational enhancements in 
Tree-Growing Methods, in Proceedings of IFCS'98, Rome (to appear). 

Kass, G.V. (1980). An exploratory technique for investigating large quantities 
of categorical data. Applied Statistics, 29, 119-127. 

Mola, F. & Siciliano, R. (1992). A two-stage predictive splitting algorithm in 
binary segmentation. In COMPSTAT'92, Y. Dodge & J. Whittaker eds., 179- 
184, Physica Verlag, Heidelberg. 

Mola, F. & Siciliano, R. (1994). Alternative strategies and CATANOVA testing 
in TWO-STAGE binary segmentation, New Approaches in Classification 
and Data Analysis, Diday E. et al. eds., 316-323, Springer Verlag. 

Mola, F. & Siciliano, R. (1997). A fast splitting procedure for classification 
trees. Statistics and Computing, 7, 209-216. 

Mola, F. & Siciliano, R. (1997). Visualizing data in tree-structured 
classification. In Proceedings of the IFCS-96: Data Science, Classification 
and Related Methods, Hayashi C. et al eds., Springer Verlag, Tokyo. 

Mola, F., Klaschka, J. & Siciliano, R. (1996). Logistic classification trees. In 
COMPSTAT *96 Proceedings, Prat A. ed., Physica Verlag, Heidelberg. 

Morgan, J.N. & Sonquist, J.A. (1963). Problems in the analysis of survey data 
and a proposal, JASA, 58, 415-434. 

Quinlan, J. R. (1992). C4.5: Programs for machine learning, Morgan 
Kaufinann, New York. 

Safavian, S. R. & Landgrebe, D. (1991). A survey of decision tree classifier 
methodology. IEEE Trans, on Systems, Man. and Cybernetics, 21, 660-674. 

Siciliano, R. & Mola, F. (1997). Multivariate data analysis and modeling 
through classification and regression trees. Proceedings of the Second World 
Conference oflASC, Wegman E. ed., Pasadena CA (in press). 

Siciliano, R. & Mola, F. (1997). Ternary classification trees: a factorial 
approach. In Visualization of Categorical Data, Greenacre M. & Blasius J. 
Eds, Academic Press. 

SICLA (1991), Manuel de I’utilisateur, INRIA, Le Chesnay Cedex, France. 

S-PLUS (1997). S-PLUS Computing Environment v.4.0, StatSci, a division of 
MathSoft Inc., Seattle, WA, USA. 

SPAD.S (1996). Systeme Portable pour FAnalyse des Donnes-Segmentation, 
v.3.2. CISIA, Saint Mande. 

Steinberg, D., & Colla, P. (1995). CART. Salford System, San Diego, CA. 

Steinberg,, D. (1996). New developments in CART (Classification & 
Regression Tree) software. In Proceedings of the Fifth Conference of the 
IFCS, Kobe, Japan. 

Venables, W. N. & Ripley, B. D. (1994). Modern Applied Statistics with S-Plus, 
Springer- Verlag, Berlin. 




STABLE: a visual environment 
for the study of multivariate data 
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Abstract: STABLE is a European Union funded project that provides a completely 
new approach to statistical programming, analysis and visualisation. The system 
combines the statistical algorithms from the Genstat statistical system with the 
sophisticated visualisation and flexible visual programming environment of the IRIS 
Explorer system to provide a unique environment for exploring statistical data. The 
system will provide a stimulating environment for programming and investigating 
new multivariate techniques, and then powerful techniques for making them 
available in encapsulated IRIS Explorer modules with suitably customized 
interfaces. 

Key words: data visualisation, distributed processing, interactive graphics, 
multivariate graphics, statistical analysis, visual programming. 



1. Introduction 

Visual inspection of data and results is an important part of any statistical analysis, 
particularly with multivariate data. Despite this need for sophisticated graphics, 
however, most statistical software offers little beyond static 2-dimensional displays. 
Furthermore, the statistical systems that do contain facilities for interactive or 3- 
dimensional graphics generally provide these in pre-programmed ’’canned” forms 
that cannot be modified by the end-user. (See Gower et al 1 997). 

In contrast, systems are available for engineers and designers that offer considerable 
freedom in the ways in which displays can be assembled and viewed, and the way in 
which feedback can be incorporated from one display to another. Some also 
encourage flexibility by use of the visual programming paradigm, instead of the 
command language approach favoured by statistical software. One good example is 
the IRIS Explorer data visualisation and application building system, which is 
providing one of the basic components for STABLE. 

The other main source of material for STABLE is the Genstat statistical system 
(Payne et al 1993). This covers all the standard statistical techniques such as 
analysis of variance, regression and generalized linear models, curve fitting, 
multivariate and cluster analysis, estimation of variance components, analysis of 
unbalanced mixed models, tabulation and time series. It also offers many advanced 
techniques, including generalized additive models and the analysis of repeated 
measurements. Additional numerical and statistical components are being supplied 
from the NAG Fortran Library (NAG 1997). 
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Figure 1 : STABLE map for a principle coordinates analysis 




STABLE integrates the Genstat algorithms with the IRIS Explorer system to 
produce a "Statistical Explorer" system combining the statistical power and 
reliability of Genstat with the flexibility, power and extendability of IRIS Explorer. 
The visualisation facilities in IRIS Explorer will encourage the development of 
new visualisation methods in statistics. Its visual programming environment will 
enable non-programmers to develop their own statistical analyses in a user-friendly 
way. The ability of IRIS Explorer to define new interfaces, will allow specific 
applications to be tailored to meet the needs of non-specialist end-users. IRIS 
Explorer also facilitates links to other systems such as databases, graphical 
information systems, symbolic manipulation packages, wordprocessing and desk- 
top publishing systems, and to users’ own source code. Furthermore it supports 
distributed heterogeneous operation, allowing graphical workstations, 
supercomputers and larger PCs to be used collectively in the solution of problems. 
Thus, it will provide a uniquely powerful environment, in particular, for research, 
development and promulgation of new ideas in multivariate graphics and display. 

2. Example 

Figure 1 illustrates visual programming by presenting the individual algorithmic 
components of the familiar Principal Coordinates Analysis. The right-hand side of 
the figure shows the map, or "visual programme", that does the analysis. The grey 
boxes, or modules, are the equivalent of commands in ordinary command-driven 
systems and the lines, or pipes, show the flow of data through the map. 
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The window in the top left of Figure 1 shows the IRIS Explorer Librarian, which 
contains the available modules, which can be dragged and dropped across to the 
right-hand window to form the map. The statistical modules in STABLE include 
all the (Fortran) statistical algorithms in Genstat. The full Genstat system can also 
be run, as a module that executes a script of Genstat commands. This guarantees 
upwards compatibility and, in particular, allows users to continue to use their 
existing libraries of Genstat procedures. Genstat has been re-engineered so that the 
STABLE modules and Genstat both use the same DLLs, to ensure compatibility 
of results and efficiency of operation. Furthermore, these DLLs will also be 
available for use in other systems. Each module has two rectangular buttons that 
provide access to menus that allow data to be input to the module (the left button) 
or taken as output to another module (the right button). Once an argument has 
been selected, buttons on other modules that can supply data of the required type 
(for input to the module) or receive data of the generated type (for output from the 
module) are highlighted in green, to facilitate correct inter-connection. 

In Figure 1, the analysis starts with a 
ReadGenstat module which reads data 
from an ascii file. The top right-hand 
button on a module opens the control 
panel (or menu) associated with the 
module. For ReadGenstat this specifies 
the file to be read and provides 
various options controlling how its 
contents are interpreted, while for 
FormSimilarity it selects the way in 
which the similarity matrix is 
calculated (Figure 2). STABLE 
supports a wide range of widgets to 
construct the menus, including list 
boxes, sliders, radio buttons, check 
boxes, dials etc, allowing custom 
menus to be formed appropriate to the 
particular end-user. Next, the PCO is calculated. First of all the similarity matrix 
is centred (CentreMat), then an eigenvalue decomposition is done (LatentRoots). 
From the LatentRoots module, the eigenvalues are taken for printing by a 
PrintGenstat module and also to the TransformData module for a square-root 
transformation to be applied. The latent vectors are taken to the MatMult module 
to be transposed and then right-multiplied by the square root of the eigenvalues 
to form the scores. The scores are printed by a PrintGenstat module, and plotted 
by a 3DScatterPlot module which produces an IRIS Explorer geometry structure 
to be displayed by the Render module in the bottom right corner of the map. 

The display generated by Render, shown in the bottom left-hand side of Figure 1, 
is an interactive 3-D plot which can be manipulated by changing viewpoints, 
lighting, colouring and so forth. Information can be extracted from the plot using 



Figure 2: Menu for FormSimilarity 
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what is known as a pick operation, and used as input to other modules, thus 
allowing the user’s interpretation of the plot to modify other aspects of the analysis 
and display. 

The system thus allows sophisticated systems to be built, at run time, for the 
interactive visual analysis of multivariate data. For example, in a biplot analysis, 
the user could interactively select variables to be displayed or highlighted on the 
graph, or to be used in back-predictions. (Gower et al. 1997). The system also 
facilitates intensive compuation. The pipes can go across the internet, so that 
different modules are running on different machines. IRIS Explorer runs on a wide 
range of machines, including UNIX Workstations, Grays and PCs running 
Windows NT. Once the map has been constructed, the details can be encapsulated 
in a grouped module. This appears like an ordinary module but in fact contains a 
map of individual modules (which may themselves be grouped modules). It is 
possible to specify which of the controls of the modules inside the group are made 
accessible to the outside user, allowing them appropriate control of the analysis 
while hiding unnecessary complications. Alternatively, the map can be stored to 
run again on a later occasion, by the developer or an end-user. 

3. Conclusion 

The STABLE project produced a first prototype at the end of 1997, and the whole 
project will be complete by the end of 1998. Also, during 1998, three diverse and 
challenging demonstration systems are being developed, for forecasting of time 
series, industrial quality improvement, and design and analysis of experiments. 

Further information is available about STABLE from the project coordinators 
NAG Ltd, Wilkinson House, Jordan Hill Road, Oxford 0X2 8DR, UK or on the 
World Wide Web at: http://extweb.nag.co.uk/projects/STABLE.html 
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Abstract: This paper describes recent results on factor analytic theory which 
involve the so-called Ledermann bound, and discusses theoretical properties of 
Minimum Rank Factor Analysis as an alternative to Iterative Princ^al Factor 
Analysis and exploratory Maximum Likelihood Factor Analysis. In terms of the 
residual eigenvalues, the three methods have closely related object fimctions, and 
will often give highly similar solutions in practice. Nevertheless, there are 
important differences between the methods. The most notable points are that 
Maximum Likelihood Factor Analysis is a method of fitting a statistical model, 
and that Minimum Rank Factor Analysis yields communahties in the classical 
sense, thus showing how much of the co mmo n variance is explained with any 
given number of factors. 

Key words; Factor analysis, proper solutions, Ledermann boimd. 



1. Introduction 

Factor analysis is based on the notion that, given a set of standardized variables 
z/,...,.Z;„, each variable Zj can be decomposed into a common part Cy, giving rise 
to correlation between Zj and z*, and a unique part w,, assumed to be 
uncorrelated with any variable except zy,y-l,...,w. Upon writing 

Zj= Cj+Uj, ( 1 ) 

7 = 1 ,. ..,/w, and using the assumption on My, we have 

Z=E.+'F, (2) 

where Z is the correlation matrix of the variables, 'P is the diagonal matrix of 
unique variances, and Zc is the variance-covariance matrix of the common parts 
Cy, 7=1,. ..,/M, of the variables. The variances of these common parts are in the 
diagonal of Z^. They are the so-called communalities of the variables. 
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The ideal of factor analysis is to find a decon^position (2) with of low rank r, 
which can be factored as Ec=FF', with F an mxr matrix. In practice, we have to 
settle for an approximation to the ideal situation, and decompose Z, for some 
small value of r, as 



Z=FF'+(ZrFF')+'F. (3) 

This means that the variances of the variables (diagonal elements of Z) are 
decomposed in explained common variances (diagonal elements of FF'), 
unexplained common variances (diagonal elements of Z^-FF'), and unique 
variances (diagonal elements of 'P). A proper solution for (3) requires that both 
Zc and ^ are positive (semi)definite. Negative elements in 'P, known as 
Heywood cases, have drawn a lot of attention. However, \^4ien Z^, the 
covariance matrix for the common parts of the variables, would appear to be 
indefinite, that would be no less embarrassing than having a negative unique 
variance in The deconq>osition (3) is particularly relevant for "large sanq)le 
methods" of factor analysis, where Z is estimated by the sanq)le correlation 
matrix. Distinguistung between the population matrix Z and the sanq)le 
correlation (or covariance) matrix S, say, will be deferred till Section 4. 

The unexplained common variance in (3) will be zero only if r equals the rank of 
Zc. Accordingly, the (classical) question is to what extent the unique variances in 
'P, or, equivalently, the communaUties in the diagonal of Z- Y, can be chosen to 
reduce the rank of Zc. The answer has an intriguing history, revolving around 
Ledermann's bound, defined as 

(p(/w)=[2/w+l-V(8/w+l)]/2 . (4) 

Ledermann (1937) proposed that the rank of (Z-'P) could always be reduced to 
this bound. In a classic paper on the history of factor analysis, Mulaik (1986, p. 
25) did mention Ledermann's boimd, but left its alleged merits unquestioned. 
Some of these merits have not been known until recently. The first purpose of 
the present paper is to familiarize the reader with these results. 

From electronic surveys of recent pubhcations (e.g., see the PsychLITS 1991- 
1997 of the American Psychological Association), it is clear that, in practice, a 
vast majority of researchers adopt Principal Component Analysis when 
Exploratory Factor Analysis could also be used, and among those who do use 
Exploratory Factor Analysis, Iterative Principal Factor Analysis (Harman & 
Jones, 1966) and E?q)loratory Maximum Likelihood Factor Analysis (Joreskog, 
1967) are about equally popular. The second purpose of this paper is to discuss 
relationships between these two methods of factor analysis, and cocqpare them 
to a third method. Minimum Rank Factor Analysis (Ten Berge & Kiers, 1991). 
The latter method has interesting properties that seem to have gone unnoticed. A 
broader exposition of these properties, in conq)arison with the two familiar 
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methods of &ctor analysis, seems warranted. Principal Conq)onents Analysis will 
not be considered here, because its purpose (condensation of variance) is not the 
purpose of fector analysis, see Widaman (1993). 



2 . The Ledermann bound 

Ledermann (1937) compared numbers of equations and unknowns, and argued 
from this that the function cp(/w) of the number of variables, see (4), should be 
seen as an upper bound to r, the number of factors needed to solve (2). This 
optimistic view was shattered impHcitly by counterexamples of Wilson and 
Worcester (1939), and exphcitly by Guttman (1958). Guttman showed that the 
universal upper bound to r is as high as /w-1, also see Bekker & De Leeuw 
(1987). Ledermann's bound seemed history, albeit that the bound still plays a 
role in the number of degrees of freedom for the chi-square test of the factor 
analysis model: That number is positive, if and only if the number of factors is 
below the bound (e.g., Joreskog, 1967, [40]). 

A revival of Ledermann's bound took place in the eighties. Shapiro (1982) 
showed that Ledermann's bound is almost surely a lower (rather than an upper) 
bound to the number of factors needed in factor analysis. Shapiro (1982) also 
proved that the minimum rank is unstable below the Ledermann bound. This 
means that, when the rank of Z can be reduced to a value below the Ledermann 
bound, any small change of any off-diagonal element of Z will increase the 
minimum rank of Z. 

Furthermore, Shapiro (1985) has shown that a 'P that solves (2) properly, which 
means that Z-'F is a positive semidefinite matrix of rank r, is almost surely non- 
unique (meaning that other proper choices of ^ exist) when r is above the 
Ledermann bound, and almost surely locally unique at and below the Ledermann 
bound. Local uniqueness means that in a neighbourhood of any ? that satisfies 
(2) no other solutions exist. It is necessary but not sufficient for global 
uniqueness. Shapiro (1985) also conjectured, as a generalization of a result by 
Anderson and Rubin (1956, Theorem 5.1), also see Kano (1990, p. 281), that 'P 
is almost surely globally unique when r is below the Ledermann bound. This 
conjecture has been proven correct (Bekker & Ten Berge, 1997). 

The results on Ledermann's bound are important in that they provide answers to 
certaiQ long-standing questions. However, from a practical point of view, the 
results are not what one might have hoped. Uniqueness of 'P usually holds for 
cases of perfect fit, as defined by (2), below the Lede rmann bound, but such 
cases do not arise in practice. We shall have to resort to solutions for (3), for 
which the uniqueness results mentioned above are not vahd. A potentially useful 
method for obtaining a meaningfid solution to (3) is MRFA. 
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3. Minimum Rank Factor Analysis 

MRFA (Ten Berge & Kiers, 1991) is an iterative method which seeks those 
unique variances (diagonal elements of ^), that minimize the sum of the m-r 
smallest eigenvalues of written as 

{lQVy= ( 5 ) 

subject to the constraint that E-'F has nonnegative eigenvalues Xi>X 2 >...>Xm ^0 
and that 'F has nonnegative diagonal elements. The smallest value of r for which 
fi attains the minimum of zero is the minimum rank of Z, see Ten Berge & Kiers 
(1991). Incidentally, when the minimum rank of Z should be less than m/2, the 
solution for 'F is almost surely unique (because r is below the Ledermann 
bound), but, more inq)ortantly, it can be obtained in closed form, see Albert 
(1944) and Ihara & Kano (1986). 

MRFA can be used to solve (2) subject to its constraints when the number of 
factors r is allowed to be large enough but such munbers of factors would be too 
large to be of practical interest. For smaller values of r, MRFA gives a rank r fit 
to the common parts see (1). This property can easily be derived from 

the well-known Eckart- Young theorem Due to the nonnegativity constraints, 
MRFA is protected against improper solutions; The resulting Zc is positive 
semidefinite and ^ is typically, but not always, positive definite (it may be 
singular). 

The practical interpretation of MRFA is that it offers a proper solution to (3) by 
minimizing the common variance that is left unexplained when as few as r 
factors are used, see (5). This is precisely what factor analysis is about. 
However, MRFA does not just minimize the unexplained common variance: It 
also reveals how small it is. This is possible because MRFA preserves the 
distinction between communaUties and explained common variances: The former 
are the diagonal elements of Zc=Z-'F, and the latter are the row sums of squares 
of F. In terms of eigenvalues of Z^, the total common variance is and 

the unexplained part of this is given by (5), whence the percentage of explained 
common variance is 100x(Xi+. . .-i-Xr)/(^i+. .+X<w). 

To see how convenient MRFA can be, consider the analysis by Iterative 
Principal Factor Analysis (to be discussed below) of nine intelligence tests by 
CarroU (1993, Table 3.2). Carroll (1993, pp. 96-98) reports that a four factor 
solution is sufficient. When MRFA is applied with r=3 and r=4, we find 
apparently unique solutions for with percentages of explained co mm on 

variance of 93.16 and 99.10, respectively. This is very clear evidence that four 
factors are indeed enough. In terms of the resulting factor solutions, the MRFA 
loadings for r=4 are very similar to those reported by Carroll. The essential 
difference is that MRFA yields, as a bonus, the percentage of explained common 
variance associated with any number of factors. 
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The iterative algorithm for MRFA is started from m+1 different rational starts 
for 'P. In practice, the algorithm has never failed to find the global minimum 
when that is known to be zero (Ten Berge & Kiers, 1991). A study on the 
sample size, needed to infer population factors from MRFA factors in a sample, 
is still in progress. Results so far indicate that samples of n=200 are more than 
enough. 



4 . Iterative principal factor and maximum likelihood factor 
analysis 

The object fimction of MRFA is similar to that of Iterative Principal Factor 
Analysis (IPFA), first suggested by Thomson (1934), as inq)lemented by Harman 
& Jones (1966). This method, also known as the Unweighted Least Squares 
option of LISREL (Joreskog & Sorbom, 1996), is aimed at minimizing, for any 
given r, the fimction 



fXF,'P)=SS{Z-FF'-T'} (6) 

where SS means sum of squares, F is an mxr matrix of factor loadings, and ^ a 
diagonal matrix of unique variances. Harman and Jones (1966) suggested an 
iterative approach, updating F on the basis of the first r eigenvectors and 
eigenvalues of Z-'P, and updating ^ as Diag[Z-FF']. The minimum obtained is 
also the minimum of the sum of squares of the last m-r eigenvalues of E-'P. This 
means that the covariance matrix Ec of the co mmo n parts of the variables is 
made as close as possible to a rank r matrix, rather than the matrix of these 
common parts themselves, as in MRFA. Still, the two minimization criteria of 
PFA and MRFA are quite similar. The most important difference is that MRFA 
minimizes the "size" of the smallest eigenvalues of E-'P subject to nonnegativity 
constraints, whereas IPFA minimizes their size without a constraint. Harman 
(1967, p. 195) has pointed to the important property that the last m-r 
eigenvalues in IPFA sum to zero, which means that they are either zero (perfect 
fit) or at least one of them is negative. It follows that IPFA yields an inq)roper 
solution for (3) m every case of less than perfect fit. 

As a computational alternative to IPFA, Harman and Jones (1966) have also 
proposed the MINRES method. This method is aimed at minimizing (6), 
expressed as a fimction of F alone, viz. 

f 3 (F)=SS { E-diag(E)-FF'+diag(FF' } . (7) 

To minimize fs, they suggested updating the rows of F iteratively by multiple 
regression. When differences between IPFA and MINRES solutions arise, this is 
due to early termination of the iterative process, or to differences in the 
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suppression of Heywood cases. It is easy to adapt MINRES so as to prevent 
Heywood cases, but in IPFA it is not at all obvious how to do this without 
disturbing the monotonical convergence of the algorithm. 

So far we have not distinguished between the correlation matrix in a sanq)le and 
in the population. The reason is that both MRFA and IPFA are "large sanq>le" 
methods, which require that the sample correlation matrix S provides an 
accmate estnnate of the population correlation matrix E. When turning to 
MLFA, the distinction between S and E becomes essential. 

The MLFA method (e.g., Joreskog, 1967) determines the population parameters 
such that they maximize the likelihood fimction of the data, assuming 
multivariate normaUty. Interestingly, MLFA also has an interpretation in terms 
of the smallest eigenvalues of a certain matrix. Specifically, MLFA minimiz es the 
variation around the value one of the smallest eigenvalues of where 

S is the sample correlation matrix, in an "approximate least squares sense", as 
explained by Joreskog (1967, p. 449). Roughly (when the approximation is 
close), this means that the sum of squares of the smallest eigenvalues of 

is minimized. This is very similar to the criterion of IPFA, and coincides 
with both IPFA and MRFA in case of perfect fit. Of course, the eigenvalue 
criterion for MLFA is not what MLFA is aimed at. However, it does explain the 
striking similarity that is often encountered between loadings obtained fi*om 
MLFA and those obtained from IPFA. 

Whereas MRFA and IPFA use the sanqile correlation matrix S as a direct 
estimate of the population correlation matrix E, MLFA estimates E indirectly as 
FF'+^, with F and 'P based on fitting the factor analysis model by the ML- 
procedure. When obUque factors are allowed, MLFA estimates E as FOF'+'F, 
where O is the matrix of correlations between the factors. This may yield an 
indefimte O, in violation of (2). To avoid this kind of anomaly, a constrained 
method of MLFA, yielding O positive definite, has been proposed by Bentler & 
Ramshidian (1994). 

In terms of deconqiosition of E, MLFA cannot yield an improper solution when 
^ is positive semidefinite, because the estimate of E satisfies (2) by 
construction. MLFA provides a test for the "ideal situation" represented in (2). 
It is not aimed at a decomposition as given in (3), which represents the situation 
of imperfect fit, to be faced in practice. An important impUcation of this will be 
discussed in the next section. 



5. In search for proper communalities 

In the classical view, factor analysis was conceived of as a two-step procedure. 
Assuming that E is well-estimated fi*om S, unique variances are determined 
which reduce the rank of E-'P as much as possible; next, E-Y is to be factored 
as FF'+residual. In modem approaches, the loading matrix is typically 
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determined by iterative methods, and the communahties are obtained as by- 
products afterwards: Computer package implementations for IPFA and MLFA 
(using orthogonal factors) yield communahties as row sum of squares of the 
loading matrix F. 

In case of perfect fit, there is nothing disturbing about this procedure. However, 
vvdien the fit is less than perfect, neither IPFA nor MLFA yield proper 
communahties: Row sums of squares of F are explained common variances, to 
be distinguished from common variances to be explained. When the distinction is 
ignored, 100 % of the co mmo n variance would be e^^lained by the r common 
frctors, which would be incompatible with non-perfect fit. Because they fail to 
yield communahties that satisfy (3), it must be concluded that IPFA and MLFA 
do not indicate the percentage of common variance explained by the factors. 
Although MRFA is fully protected against the improper solutions, and although 
the method restores the concept of conmnmahty m the traditional sense, the 
communahties of MRFA are not necessarily equal to the "true" communahties. 
By definition, communahties are a property of the set of variables under study, 
independent of the factors used to decompose the common variance. With 
MRFA they are not, as wiU be demonstrated in the next section: The 
communahties of MRFA do depend on the number of factors. Hence, MRFA 
cannot be claimed to reproduce the true communahties as long as the "true rank" 
is unknown. This embarrassing problem, however, seems to be inherent to any 
method of factor analysis. 

Compared to MRFA, EPFA has a computational advantage. MRFA is quite 
laborious (it may take several hours on a 486PC) when the number of variables 
exceeds 20. For IPFA, this is not nearly as much of a problem. Compared to 
MLFA, MRFA lacks the statistical footing of estimating population parameters, 
and it lacks the flexibihty in adjusting hypothesized loadings or correlations 
between factors: Confirmatory and exploratory MLFA can be carried out in a 
unified ML-framework, with exploratory MLFA as the special case where no 
restrictions are inq)osed other than those needed for identification of the 
loadiQgs. On the other hand, MRFA is useful in evaluating, rather than testing, 
hypotheses about the number of factors needed for perfect fit: The percentage of 
une?q)lained common variance is a natural measure of departure from this 
hypothesis. Nowadays, MLFA is typically accompanied by measures of fit 
(Browne & Cudeck, 1992; Cudeck & Henly, 1991; Hu, Bentler & Kano, 1992), 
a number of v^ch reflect the deviation of S from FF'+^. None of these, 
however, seem to be a direct measure of the percentage or proportion of 
explained common variance. 

In deciding between MRFA and MLFA for Exploratory Factor Analysis, it is 
tempting to choose the best looking method according to one's taste. However, 
this attitude is incompatible with the view that appUed statistics is an empirical 
science. One should simply adopt the method that works best., e.g., in Monte 
Carlo studies, where the truth is known by construction. MRFA and its fit 
measure, the percentage of explained common variance, will have to prove their 
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potential merits in comparative simulation studies. Apart from that, MRFA may 
be helpfiil in theoretical work on frctor analysis (e.g., Heywood cases), because 
it is the &st method of factor analysis that yields a proper solution for (3) under 
a reasonable optimality criterion. 



6. Lord’s test data 

A reanalysis was conducted of 15 psychological tests, selected by Joreskog 
(1971, pp. 119-124) from a larger battery of Lord (1956), administered to a 
sartq)le of 649 subjects. The tests are of three kinds: Vocabulary (1-5), 
Intersections (6-10) and Arithmetic Reasoning (11-15). For each kind of test, 
there are two speeded versions (1, 2, 6, 7, 11, and 12) and three level versions. 
Joreskog apphed confirmatory MLFA to these variables. From the high value for 
chi-square with the three factor hypothesis he inferred that more factors were 
needed. In fact, he adopted a five or six factor solution. 

On the same data, MRFA was carried out, with r=3, r=4 and r=5. In each case, 
16 independently started runs converged to a unique set of communahties. The 
percentages of explained common variance were 91.50, 94.67, and 96.52, 
respectively. The Varimax rotated loadings and communahties are reported in 
the left hand part (for r=3) and the right hand part (for r=4) of Table 1. Clearly, 
the first three (trait) factors for r=3 and r=4 are essentiafly the same. The 
additional factor IV wlien r=4 is a speed-level contrast factor: This factor has all 
correlations with speed tests positive and ah correlations with level tests 
negative. It explains hardly any variance (2.4 % of the common variance) at all. 
There is no reason to adopt a fifth factor, let alone a sixth. Considering that 
factor analysis should strike a compromise between parsimony (small r) and 
meaningfulness, the MRFA results are quite satisfactory. 

When EPFA is apphed to the same data, the resulting loading matrices for r=3 
and r=4, respectively, closely resemble those of Table 1. The key difference, 
however, is that IPFA does not show proper communahties, which means that 
(3) is not solved, and evaluating the percentage of explained common variance is 
not possible. 

When exploratory MLFA (EQS: Rentier, 1995; LISREL: Joreskog & Sorbom, 
1996) is apphed with r=3 and r=4, respectively, setting the upper triangular 
elements of F to zero for identification of the loadings, we find essentiahy the 
same loadings as with MRFA However, the two models do not fit {p<0l\ and 
the chi-square test would lead us to reject a perfectly satisfactory solution. Also, 
MLFA does not ahow an evaluation of the percentage of explained common 
variance. 

As was mentioned above, the MRFA communahties for r=3 and for r=4 
appeared to be unique. In fact, multiple solutions were not encountered when r 
was stepped up to 5, 6 ,..., 10 (values above 10, the Ledermann bound when 
m=15, were not considered). This means that uniqueness below and even at the 
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Ledermann bound, although not granted, appears to hold here. Nonetheless, 
Table 1 shows that the conununahties do change when we go jfrom r=3 to r=4 
factors. This is the embarrassing problem of MRFA, alluded to above. 



Table 1 


MRFA 


loadings 


and communalities for Lord's 


data (r= 


II 




Test 


I 


n 


m 




1 


n 


m 


IV 




1 


0.035 


0.764 


0.164 


0.172 


0.030 


0.760 


0.168 


0.208 


0.707 


2 


0.092 


0.784 


0.169 


0.764 


0.085 


0.784 


0.171 


0.288 


0.780 


3 


0.094 


0.802 


0.218 


0.742 


0.098 


0.802 


0.219 


-.090 


0.739 


4 


0.019 


0.911 


0.202 


0.933 


0.024 


0.912 


0.204 


-.114 


0.933 


5 


0.011 


0.857 


0.238 


0.860 


0.018 


0.859 


0.240 


-.193 


0.859 


6 


0.802 


0.069 


0.214 


0.813 


0.800 


0.065 


0.212 


0.231 


0.827 


7 


0.833 


0.023 


0.163 


0.779 


0.829 


0.021 


0.162 


0.075 


0.769 


8 


0.848 


0.086 


0.188 


0.800 


0.849 


0.085 


0.168 


-.036 


0.796 


9 


0.891 


0.053 


0.165 


0.878 


0.893 


0.052 


0.163 


-.074 


0.874 


10 


0.868 


0.029 


0.237 


0.850 


0.872 


0.029 


0.234 


-.107 


0.852 


11 


0.221 


0.183 


0.649 


0.580 


0.216 


0.178 


0.652 


0.203 


0.588 


12 


0.206 


0.167 


0.671 


0.593 


0.204 


0.163 


0.669 


0.178 


0.585 


13 


0.172 


0.201 


0.766 


0.703 


0.178 


0.201 


0.764 


-.125 


0.698 


14 


0.217 


0.199 


0.728 


0.657 


0.221 


0.197 


0.728 


-.054 


0.657 


15 


0.129 


0.232 


0.729 


0.634 


0.135 


0.232 


0.730 


-.099 


0.635 


SS 


3.806 


3.618 


2.913 




3.813 


3.610 


2.911 


0.363 
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Abstract: This paper introduces two forms of informational complexity ICOMP 
criteria of Bozdogan (1988, 1990,1994) as a decision rule for model selection and 
evaluation in Bayesian Confirmatory Factor Analysis (BAYCFA) model due to 
Press and Shigemasu (1989) in contemporaneously choosing the number of factors 
and determining the ''best” approximating factor pattern structure. A Monte Carlo 
simulation example with a known factor pattern structure and known actual 
number of factors is shown to demonstrate the utility and versatility of the new 
approach in recovering the true structure. 

Key words: Bayesian Confirmatory Factor Analysis; Choosing the Number of 
Factors; and Informational Complexity. 

1. Introduction 

The main purpose of factor analysis as an important multivariate statistical 
technique, is to determine whether or not the correlations among a large number of 
observed variables can be explained in terms of a relatively small number of 
underlying factors and how many factors best fit the data {the exploratory stage). 
After choosing the best fitting number of factors, the question then becomes how 
the extracted factors depend upon the original observed variables, and how do we 
incorporate the prior information in the factor model {the confirmatory stage). 

The standard confirmatory factor analysis (CFA) model currently practiced, 
does not incorporate the prior distributions of the parameters and the postulated 
prior information on the specific factor pattern structures within the model. In 
CFA, the prior information is fixed and it is used statically. Since information 
changes over time and prior information is often a random behavior, a new 
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approach is needed to estimate the number of factors and to determine the ‘best ' 
approximating factor pattern structure contemporaneously. This will help us to 
capture the expert judgement of the researchers to validate their substantive 
models as the prior information might change over time and expert to expert. 

The purpose of this paper is to implement various information theoretic 
procedures as new post-data analytic procedures, that is, decision making in the 
light of the data in BAYCFA model which was developed by Press and Shigemasu 
(1989), where the number of factors and the postulated factor pattern structure 
were chosen arbitrarily. In this paper, two forms of informational complexity 
(ICOMP) criteria of Bozdogan (1988, 1990, 1994) are derived and estimated 
under the oblique Bayesian factor model using a generalized natural conjugate 
family of prior distributions. For comparative purposes the classic Akaike's (1973) 
information criterion (AIC)^ and Rissanen's (1978) minimum description length 
(MDL) criteria are also developed. To eliminate the subjective specification of the 
hyperparameters, in this paper, we estimate the imperfectly known hyperpara- 
meters of the model using a clustering method. This will be based on the sparse 
root algorithm of Hartigan (1975) to obtain the “best” approximating factor pattern 
structure to be used in determining the initial prior hyperparameters for the 
BAYCFA model. Such an approach provides a valuable mechanism to achieve 
flexibility in BAYCFA modeling. 

A simulation example with a known factor pattern structure and known actual 
number of factors is shown to demonstrate the utility and versatility of the new 
approach in recovering the true structure. 



2. Bayesian Factor Analysis (BAYFAC) Model 

Let X (pxl) be an observable random vector with mean vector p and the covariance 
matrix I. We write the usual factor model in the matrix form given by 

Xj =Afj + m<p, (1) 

where A is a (px m) matrix called the factor loading matrix; fj is an (mx 1) vector 
which represents the common factors or factor scores for individual j, 
^' = (/i, •••>/„); andf^ is a (pxl) vector of random errors or disturbances and 
whose variances y/ ^ represents the “uniqueness” of Xj . 

The basic assumptions for the model in (1) are: 

/y~A^j0,/J;and^,~iV,(0,4^), (2) 

where is a symmetric positive definite (p.d.) matrix of random errors, and that ^ 
is not a diagonal matrix in our case as in the usual factor analysis model; and 
fj and £j are independent. 

We assume that (a,F,T) are unobserved and fixed parameters. Further, 
we write the probability distribution of the model in (1) equivalently as 
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(3) 

For a random sample of n observations, the likelihood function is given by 

i(A,F,4'|A:)oc (4) 

where “oc” denotes the constant of proportionality which depends on (p, n), 
and not on (a,F,'F). 

Following Press (1982), Press and Shigemasu (1989), we use a generalized 
natural conjugate family of prior distributions for (A,4^) . We take as prior density 
for the parameters: 

p(a,F,'P) = p(a|4>);7(^)/7(f)l (5) 

We adopt the normal prior density 

/^(aI'p)^ I'Pf? eArpj- W/-(A - A„)H(A - A„)''P-' | , (6) 

where Ao is a (pxm) prior factor loading matrix, and H is an (mxm) prior 
inter-factor correlation matrix. Both of these matrices are hyperparameters that 
need to be estimated in practice. 

Next, we assume that the prior distribution for 4^ is: 

an inverted Wishart distribution which is denoted by WVB,/?,v).In (7), B is a 
(pxp) scale matrix of hyperparameters, andc^(i7, /?)is a constant depending only 
upon (v,/?) and not on the unknown number of factors in the model, given by 



c;\v,p) = (8) 

y=i ^ 

with V degrees of freedom, which is also a hyperparameter. 

If we consider a vague (or uninformative) prior density for the factor scores 
F, i.e., if /7 (f) oc constant, then F\X follows a matrix T-distribution. If we 
suppose fj ~N(0,1), and the f. ’s are mutually independent, then 



/7(F) oc expi^ — rrFF I (spherical prior density) (9) 

Now, using the Bayes theorem, we obtain the joint posterior density of the 
parameters by combining the equations (4),(6),(7), and (9). This is given by 

p(A,F,'PlX) oc i{A,F,4^lx)p(A|^)/7(^)/7(F) 



oc p{Fp>\^''"”'"^'\xp\- ^tr'¥-'G\, 



( 10 ) 



where 



G = {X- FKy{X - FA') + (A - Aq )H(A -AJ + B. (11) 

As shown in Press and Shigemasu (1989), the respective marginal posterior 
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distributions are obtained from (10) by integrating out the nuisance parameters 
sequentially. For more details, we refer the readers to Press and Shigemasu (1989). 
For brevity, here, it will suffice just to give the necessary Bayes estimates of factor 
scores, factor loading matrix, and the disturbance covariance matrix respectively, 
which we need to develop ICOMP, AIC, and MDL criteria. These Bayes estimates 
of the unknown parameters are as follows. 

• First, the Bayes estimate of the factor scores is given by 

F^{I„-XW^X)XW^K,n ^^2) 

= [/„ - X{XX' - jvy' AT'lw' AoH, 

where 

^ = A!X'-fB + AoHAo . (13) 

• The Bayes estimate of the factor loading matrix A, conditional on F and X 
is given by 

A = (Xf + AoH)(H -f FT)-' . (14) 

• The Bayes estimate of the disturbance matrix 4^ is given by 

= G/{n + m + p-2p-2), 

where 

G = (X- FAJiX - FA') + (A - Ao)H(A -Aj + B, (15) 
and where v is the degrees of freedom of the distribution of 4^. Usually we do not 
have a substantial knowledge about T, and so we probably want to make v as 
minimal as possible. Rigorously speaking, one may want to change v depending 
on the number of factors extracted, because our knowledge about 4^ may be 
different when we presuppose a different number of factors. However, in 
applications, we suppose this change is negligible. Certainly, it will be of interest 
to study the sensitivity of the number of factors as a function of v. The results of 
this will be reported elsewhere. 



3. ICOMP, AIC, and MDL Criteria for the BAYFAC Model 

In this section, based on the Bayes estimates obtained in Section 2 above, we 
construct the covariance structure for the m-th factor model given by 

• Mop: t^=A„A^ + (the orthogonal factor model), and 

• Mobf: = A^H^,A^ + (the oblique factor model), 

where m=l, 2,..., M is the number of factors, and m<M< l/2[2p-f-l- ^8/7 + 1] . 
We call this bound big-factor. It is the largest number of factors we can allow and 
still have the number of free parameters s={mp-^p)-\l2in{m-l) in the factor model 
be less than the number of parameters in the covariance matrix. Now using these 
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covariance structures and treating the factor model as a Random Factor Model 
(RFM), we give the two closed form expressions of ICOMP criteria of Bozdogan 
(1988,1990, 1994), Akaike's (1973) AIC, and Rissanen's (1978) MDL criteria. 

The first approach of ICOMP uses the covariance matrix properties of a model's 
parameter estimates starting from their finite sampling distributions. In this case, 
we have 



ICOMPogp ( BayFac) = nplog(2a-) + n\log^„ + trlL„ S] 

r 1 

+ 2[(/« + 1)C,(TJ + ;7C,(H„-‘)J 

whereC,(Q) = is the maximal entropic measure of 

complexity of a (p.d.) covariance matrix Q of van Emden (1971) which was first 
introduced by Bozdogan (1988) in multivariate linear and/or nonlinear parametric 
modeling as a new penalty functional. Instead of penalizing a model based only on 
the number of parameters in the model, as in AIC, ICOMP penalizes based upon 
the covariance structure of the components of the model. In this manner, ICOMP 
incorporates the interactions among the components of the model and measures the 
inequalities among the variances and the contribution of the covariances. Note that 
if H=sl and H‘*=l/sl, then Ci(H’^)=0, where s>0, a scalar constant. Hence for the 
orthogonal BAYFAC model (16) reduces to 

ICOMPof( BayFac) = nplog(2;r) + «[/og|i„| + 

+ 2(m + l)C, 

The first component of ICOMP in both (16) and (17) measures the lack of fit, 
the second component in (16) measures the complexity of the covariance matrix of 
the estimated disturbance which is not a diagonal matrix in BAYFAC model, and 
the third component measures the complexity of the estimated inter-factor 
correlation matrix. 



The second approach to ICOMP uses the complexity of the estimated inverse- 
Fisher information matrix (IFIM) of the model with regards to the model's 
parameter space. We denote this by ICOMP (IFIM). Similar to ICOMP we have 



ICOMPo,, {IFIM) = -21ogL(4 ) + 2C, (F„' ), (18) 

where 

A 

0 
n 

is the estimated 7F/A/,and where ® denotes the Kronecker product. Dp"^ is the 
Moore-Penrose inverse of the duplication matrix Dp. Operating the complexity 
measure C\ on the estimated IFIM, and after some work, ICOMP obf{IFIM) is 
given in an open form by 
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ICOMP^,, {IFIM) = nplog(2^) + n[logt„ + trt„-'S] 



PiP+^) 



log 









- plog(2) + p(p + \)l2log{n) - (/? + 2)log t„ 



( 20 ) 



The first component of ICOMPobf{1F1M) in (20) measures the lack of fit of the 
model, and the second component measures the complexity of the estimated 
inverse-Fisher information matrix (IFIM), which gives a scalar measure of the 
celebrated Cramer-Rao lower bound matrix which takes into account the 
accuracy of the estimated parameters and implicitly adjusts for the number of 
free parameters included in the model. 

With both ICOMP criteria, complexity is viewed not as the number of 
parameters in the model, but as the degree of interdependence, (i.e., the 
correlation structure among the parameter estimates). By defining complexity in 
this way, ICOMP provides a more judicious penalty term than AIC, MDL, and 
other AlC-type criteria. 

We derive Akaike's (1973) AIC given by 

AIC( BayFac) = nplogiln) + «[/og|z„ | + 

+ l\Number of nonzero loadings + l/2m(m -1) + p\ 

Similarly, the expression of Rissanen's (1978) MDL, which is identical in form to 
the Schwarz (1978) Bayesian criterion (SBC), is given by 

MDL/SBC( BayFac ) = nplog(27r) + «[/og|s^ | + trt^~^S] 

+ [Number of nonzero loadings + I/2m(m - 1)-^ p\og(n) 

Comparing ICOMP, AIC, MDL/SBC, we note that the difference between these 
criteria is in their crucial penalty term. 



4. A Simulated Numerical Example 

In this section, we give a simulated numerical example based on an oblique factor 
model structure provided by Mulaik (1989) with the population parameters for the 
actual number of factors m*=3, and p=9 variables given by 





■Q.6 


0.7 


0.8 


0.2 


0 


0 


0.2 


0 


0 




'1.00 


0.08 


0.12' 


A' = 


0.2 


0 


0 


0.6 


0.7 


0.8 


0 


0.2 


0 


,H = 


0.08 


1.00 


0.24 , 




0 


0.2 


0 


0 


0.2 


0 


0.6 


0.7 


0.8 




0.12 


0.24 


1.00 




341 



where A' is the transpose of (9x3) factor loading matrix A, and H is (3x3) the 
inter-factor correlation matrix. The unique variances are given by the diagonal 
matrix 

4^=Diag(0.581, 0.436, 0.360, 0.581, 0.403, 0.360, 0.571,0.403, 0.360). 

In our study, first, we generated a random sample of size n=100 observations 
from a multivariate normal distribution with mean vector |li=0 and the covariance 
matrix Z=AHA'+ T, with A, H, and given above. Next, by pretending that we 
do not know the number of factors, we fit m=l, 2,..., 6 factors. We use the 
maximum likelihood factor analysis (MLFA) and obtain the varimax rotated (or 
unrotated) ML factor loadings, and the model correlation matrices for each of 
m=l, 2,..., 6 factors for this simulated data. Then, we apply the sparse root 
algorithm of Hartigan (1975, p.314) to obtain the factor pattern structure of A^, 
the prior factor loading matrix data-adaptively for each of the models. This 
algorithm is an iterative clustering procedure, which works on a factor model 
correlation (or covariance) matrix with a sample correlation matrix of the data. 
This technique produces the loading matrix and its simple pattern structure as the 
root of the model correlation matrix. It seeks roots of the model correlation matrix 
with many zeros. The zeros correspond to entries (or variables) that are not 
members of a given cluster. A given column of the root matrix represents the 
cluster. The root matrix is determined iteratively by sweeping the model 
correlation matrix one column at a time. The approach is essentially a principle 
component type procedure which takes into account the linear combination of the 
original variables. The columns of the root matrix are chosen to be eigenvectors of 
the model correlation matrix for which the ratio, (eigenvalue)/(number of nonzero 
elements of the eigenvector), is a maximum. The model correlation matrix is 
modified in terms of a partial correlation matrix at each stage of the fitting process 
until the root, that is, the simple pattern structure of the loading matrix is reached. 
In this manner, we discover the simple pattern structure of a factor loading matrix 
data-adaptively, which permits an interpretable factor pattern of the final factors as 
clusters of variables. Now, we use these pattern structures corresponding to m=l, 
2 ,..., 6 factors to initialize our prior factor loading matrix Ao, prior inter-factor 
correlation matrix Ho, and the prior disturbance matrix respectively in the 
BAYFAC algorithm. With this set up, we then ran 100 simulations and generated 
new data sets for each m=l, 2 ,..., 6 factors and scored the number of times the 
information criteria selected the true number of factors and the corresponding 
factor pattern structure for different sample sizes. The results are summarized in 
Table 1 below. These results show that ICOMP and ICOMP(IFIM) criteria 
outperform AlC and MDL in selecting the number of factors and the associated 
factor pattern structure in small samples which we like to see in Bayesian 
modeling. All the computations are carried out using new open architecture easy- 
to-use interactive computational modules called BAYFAC, which are developed by 
the first author in MATLAB® . 
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Table 1: The number of times information criteria selected the true m=3 factors. 



Sample Sizes 


n=30 


n=50 


Criteria\m 


1 


2 


3 


4 


5 


6 


1 


2 


3 


a 




6 


AIC 


8 


20 


70 


0 


0 


0 


75 


25 


0 


m 




0 


MDL/SBC 








m 


0 


D 


100 


m 


0 


H 


0 


H 










D 


0 


0 


62 


m 


0 


0 


0 


H 


ICOMP(IFIM) 


0 


0 


98 


B 


0 


D 


0 


8 


92 


0 


0 


0 
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Nonlinear Biplots for Multivariate Normal 
Grouped Populations 
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Abstract: In this paper a new method for the simultaneous representation of 
populations and variables associated with multivariate normal model (without 
covariance matrix restrictions) in a low dimensional space is developed. The 
method is based on non-linear biplot methodology over the Siegel metric, which 
allows us to define a distance between populations. Finally, the variables are 
represented by the directions of maximum variation of their means. 

Keywords: Nonlinear biplots. Siegel distance. Multivariate normal distribution. 



1. Introduction 

Biplots are used to represent samples and variables jointly, in a low dimensional 
plot. Gower and Harding (1988) generalised the classical biplot method 
considering embeddable metrics in an Euclidean space, and proposed to extend 
this idea to any kind of metric. On the other hand, there is no doubt that one of 
the most used (continuous) models in multivariate data analysis is the 
multivariate normal distribution. But when we want to plot, in a low dimensional 
space, samples from populations associated with this model, we find that the 
existing methods are restricted to the assumption of a common variance matrix 
(as in the classical canonical discriminant analysis introduced by Rao, 1948, since 
the Mahalanobis distance is used) or, at least, to common eigen-vectors of their 
covariance matrices (Krzanowski, 1996). Since none of the previous 
assumptions holds in practice, it seems reasonable to look for a low dimensional 
representation without the previous assumptions that includes representation of 
the variables. 

To reach this goal, we propose to follow the non-linear biplots approach, but we 
still need a metric for multivariate normal model without restrictions. Although 
the Hellinger or Bhattacharyya distances could have been chosen, in this paper 
we propose to use the Siegel distance for the following reasons (see Calvo and 
Oiler, 1990, for more details): the Siegel distance is not upper bounded, as are 
the other two; it is invariant to affine transformations over the random variables 
and it is closely related to the Rao distance. 
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2. Representation of Populations 



Let us consider the populations Qi, . Qp, where the probabilistic model of the 
Xi,X 2 , * x^ vector samples is the multivariate normal model. For each population, 



fli, its associated density function has parametric representation (|^i,Si). The 
multivariate normal embedding into the Siegel group, see Calvo and Oiler 
(1990), is obtained by associating each density with the following symmetric 
definite-positive matrix: 






S, +//,//■ 



1 



(1) 



In practice, (|Lii,Zi) are unknown, so are substituted by their maximum likelihood 
estimators ( x.,S^ ) to represent flj, giving: 



S, = 



^Sj+x,x] x^ 



( 2 ) 



The Siegel distance between the populations Hi and Qj is defined as the 
Riemannian distance between the two matrices into the Siegel group where the 
populations are embedded: 






In 






s.s: 






( 3 ) 



where ||^||= ^Jtr{AA^) , ln^5, j stands for the ordinary matrix 

logarithm and X, are the Eigenvalues of S^^^^S^S~^^^ . 

From (3) we compute the interdistance matrix D between the p populations: 

D = = (4) 



Now, we can use classical Principal Coordinate Analysis to reduce the 
dimension. The computing process starts by obtaining the matrix T: 






with 









di)- 



/■j = l 



( 5 ) 



and then, diagonalizing it 
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T = P A =WW^ with P P^ =1 and W = PA'^\ Adiagonal. (6) 

The principal coordinates are the columns of W. Usually, only the first and 
second principal coordinates are used in the final plot. 



3. Representation of the Variables 

To represent the k-th variable, the non-linear biplot defines a pseudo-sample ack 
(where Ck is a unit vector of the original A:-th axis). Then, the results of Gower 
(1968) for representing a new point in A” , with known distances from p points, 
are used. However, in our case, this methodology can not be applied directly 
because there is no sense in defining a pseudo-sample, since our points are 
multivariate normal densities each one represented by a matrix S- , depending on 

S,), that is, w + — ^ parameters (where n is the number of observed 

variables). To solve this problem, we propose to represent each variable 
(/ = 1,. . .,A 2 ) in the original space by the direction of the maximum variation 
of its mean E(Xi). Notice that the above-mentioned pseudo-samples in a classical 
biplot also correspond to the directions of maximum variation of the observed 
variables. 

In more technical terms, the gradient of E(X) must be computed and, by 
integrating it, the directions of maximum variation of E(Xi) are obtained as a 
bundle of curves. In the classical linear-biplot procedure, the concurrence point 
of the trajectories representing the variables coincides at the centroid of the 
sample-points. In our case, there is no such obvious point, but we propose as a 
candidate, , the barycenter (centre of mass) of the populations, although other 
points could also be considered. The curve of maximum variation of the random 
variable X ^ , which passes through S^/is: 






^loe.e', Sot" + SoC,//',. t + 

e- lot 0 



( 7 ) 



where e, denotes the / vector of the canonical basis, and t the curve parameter. 
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At this point, each variable is represented as a trajectory (collection of points) in 
the Siegel space and every point (the y-th point of the curve corresponding to 

the /-variable) will be mapped in the principal coordinate space following Gower 
and Harding (1988) by 

V,, ^ W,^ = ^-^~'Fd, (8) 



where A and P are the same as in expression (6), d - with 



d, =< and di 

/=1 P r=l J=1 



where is computed as in (3). 



( 9 ) 



4 . Example 

Thomson, A. and Randall-Maciver, R. (1905) reported four measurements of 
150 male egyptian skulls from five different time periods. The variables 
measured were: MB (maximum breath of skull), BH (Basibregmatic height of 
skull), BL (basialveolar length of skull) and NH (nasal height of skull). And the 
populations to study are the five time periods: Early Predynastic (about 4000 
years B.C), Late Predynastic (about 3300 B.C.), 12-13^ dynasty (about 1850 
B.C.), Ptolemaic (about 200 B.C.) and Roman (about 150 A.C.). A first 
analysis shows that the homogeneity of the variance-covariance matrix does not 
hold (Barlett’s test: P=8.94x 10“’^), so that classical techniques of graphical 
display that assume homogeneity of the covariance matrix should not be used. 
Applying the method described here, we compute the interdistance matrix 
according to (3): 



' 0.94308 " 

1.36731 1.22836 

1.71955 1.59887 1.15124 

, 1.78993 1.59732 1.15836 1.18607 . 

Then we compute the classical Principal Coordinate Analysis to reduce the 
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dimensionality. All the non-zero eigenvalues are positive and the first two 
represent 76.62 % of their sum. 

Finally, we can represent the variables according to section 2. In figure 1 we 
have the joint plot of populations (points) as and variables (trajectories) 



Figure 1: Non-linear biplot of five egyptian periods. The barycenter of the 
populations is the common point of the trajectories. Equal intervals in the 
original scales of the variables are indicated. 




From the situation of populations in figure 1 it is clear that time progresses from 
left to right. Therefore MB and NH increase over time, while the other two {BH 
and BL) decrease, which agrees with the variable vs. time plots (not shown). It is 
also interesting to notice the irregular spacing of scale points along the 
trajectories suggests further distortions of the space (see the positive end of 
variable BH for instance). 



5. Discussion 



In this paper we have introduced a new method based on non-linear biplot 
techniques for simultaneous representation of populations and variables 
associated with multivariate normal models. Although simultaneous displays for 
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special cases had been introduced (Rios et al., 1991, for a common covariance 
matrix), the main advantage of our method is that there are no restrictions on the 
covariance matrix structure. Moreover, it is invariant to affine transformations 
over the random variables and is related to the Rao distance (the Siegel distance 
used here is a lower bound the Rao distance). One limitation of our method is 
the possibility of obtaining negative eigenvalues in (6), since the Siegel distance 
is not Euclidean. However, usually, the negative eigenvalues of the A matrix will 
be (in absolute value) quantitatively not very important relative to the positive 
ones and can therefore be ignored. If this is not the case, a more general 
technique such as multidimensional scaling could be used instead (see e g. 
Gower and Hand, 1996). 

The variables are represented by the directions of maximum variation of their 
means and a concurrence point for their trajectories must be chosen. We 
propose the barycentre of the populations, although others points could be 
taken. For instance, we can represent different trajectories from a variable in the 
same plot. Another advantage of this method is that we are not restricted to 
display only the first moment variations, but moments of any other order 
(including mixed moments) as well as other parametric functions of the random 
variables, too. 
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Abstract: The paper concerns clustering values of discrete-continuous 
variables X and Y when the numbers of clusters are given a priori. The 
clustering aims at minimal reduction of absolute dependence between X and 7, 
measured by the maximal value of Spearman's rho. The grade correspondence 
analysis (GCA) based on Spearman’s rho is performed first and then followed 
by discretization of the pair of the GCA variables which is directly linked to 
discretization of X and Y . Discretization of only one variable is considered in 
Sec. 3. It is then used in Sec. 4 to construct an algorithm of simultaneous 
discretization of both GCA variables. Two examples are presented. 

Keywords: Correspondence Analysis, Positive Dependence, Correlation 
Curve, Maximal Grade Correlation, Spearman's Rho, Mixed Variables, 
Clustering Algorithm 



1. Introduction 

Simultaneous aggregation of rows and columns in two-way contingency tables 
is considered in several papers since Hartigan (1975). In a slightly more general 
framework, pairs of categorical variables are replaced by pairs (X^Y) in which 
any variable may be discrete-continuous. An optimization criterion should be 
based on a measure of absolute dependence since dependence is inseparable 
from concentration and hence from clustering. We have chosen to maximize a 
generalized version of Spearman's rho (denoted p*), applicable for any pair of 
mixed (discrete-continuous) variables. It is shown in Kowalczyk et al. (1996) 
that p* can be represented as an index of concentration. 

The present paper is a continuation and extension of Kowalczyk et al. 
(1996) and Ciok et al. (1994, 1995). In Ciok et al. (1995) the grade 
correspondence analysis (GCA) was introduced as an analogue of the classic 
correspondence analysis. GCA consists in maximizing p* over the set of all 
possible pairs of permutations of the supports of X and of Y . The optimal pair 
is called the pair of GCA variables. As argued in Kowalczyk et al. (1996) and 
Ciok et al. (1995), the value of p* for any pair of aggregations of the GCA 
variables can be dominated by the value of p* for a pair of suitably chosen 
discretizations of these variables. This is why we can restrict to discretizations 
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following GCA . But which scheme of discretization is then appropriate? 

The algorithm proposed in Ciok et al. (1995) provides a sequence of 
discretizations such that the numbers (w, i*) of clusters for X and Y derived at 
step /are changed at step / + 1 into («- 1,^) or {n,s- 1): the algorithm finds at 
each step a pair of adjacent rows or adjacefit columns to be pooled which 
minimizes the loss of the value of p\ But the authors observed that this may 
lead to serious drawbacks even for distributions with very regular positive 
dependence, because the discretizations in the sequence may be far from 
optimal ones with the same numbers of clusters. 

The method proposed in the present paper tries to approximate optimal 
discretization when the cluster numbers (n,s) of the solution are given a priori. 
A suitable choice of (n,s) is beyond the scope of this paper. 

Section 2 contains basic information on GCA . Discretization of just one 
variable is described in Sec. 3. Section 4 provides an iterative algorithm for two 
variables; it is based on alternating use of the solutions only for X and only for 
Y when in each case the other variable in the pair is previously discretized 
according to the scheme from the preceding iteration. 

Applications of the proposed method are not restricted to clustering in 
contingency tables or, more generally, bivariate distributions. There are many 
situations in data analysis when it is desirable to treat a particular table with 
non-negative values as a contingency table. We can deal with sets of 
compositional data or objects-variables data. Analyses of real data based on 
GCA are given e.g. in Ciok et al. (1997), Ciok (1995). In both cases a proper 
clustering was crucial for investigation. 



2. Grade parameters describing two-way probability tables 

Let X and Y be categorical variables distributed according to 
P = (/^y),/ = l,...,/w;7 = 1,...,/:. Let (X\Y*) be a pair of continuous variables 
distributed on [0,1]“ with the constant density on rectangle 

{(w, v): 5,_, <u<S,, < V < } where .S', = p^.J and 

Tj = Xt, A/ . 7 = • • • . ^0 - 0 The density of (X\Y*) is called the 
randomized grade density of (X,Y). The regression functions E(Y*\X* = t) 
and E(X*\Y* = t) are called the randomized grade regression functions of Y 
on X and of X on 7, denoted C{Y.X) and r\X.Y), respectively. Further, 
the correlation coefficient cor{X\Y'') is called the randomized grade 
correlation coefficient of {X,Y) and denoted p\X,Y). It is equal to the 
Spearman’s rho for pairs of continuous variables, to the Schriever’s extension 
of Spearman’s rho for pairs of categorical variables (Schriever (1985)), and 
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may be presented as; 

p{X,Y) = 6[{u-C{Y:X){n))du = e[{t, - C{XJ){u))du 

where C’ (Y: X)(t) = ij^r' (Y: X)(u)du is the randomized grade correlation 
curve while C*(Jf:7) is defined analogously. 

Parameters C* are used to find the transformations of X and Y which 
maximize the value of p* in the set of pairs {f{X),g{Y)), where / and g are 
1-1 functions f.S^. S^^y supports of X and 7, 

respectively. It is known (Kowalczyk et al. (1996)) that the maximal value of 
p* is attained only if both regression functions r\Y.X) and r*{X.Y) are non- 
decreasing or, equivalently, if curves C*{Y\X) and C*{X:Y) are convex. This 
property has been exploited for constructing a GCA algorithm described in 
Ciok et al. (1995), which approximates the maximal value of a pair of 
optimal transformations of X and 7, and the respective pairs of optimal grade 
regression functions as well as optimal grade correlation curves. 

We will say that X and 7 are GCA - mutually positively dependent if the 
value of p* {X,Y) cannot be increased for any pair of 1-1 transformations of 
X and 7. A well-known vague interpretation of mutual positive dependence 
between X and 7 is that in each set of conditional distributions, {A"|7 = >^} 
and {Y\X = x)^ conditional distributions in the neighbourhood tend to be 
similar. This suggests an algorithm of mutual clustering of values of X and 7 : 
firstly, X and 7 are transformed to be GCA - mutually positively dependent; 
secondly, since pooling adjacent conditional distributions is then justified, we 
try to discretize both variables. 



3. Discretization of only one variable 

Throughout this section and the next one we assume that X and 7 are GCA - 
mutually positively dependent. Discretization of only one variable in the pair is 
based on the following fact (the proof is omitted). 

Let F be a convex function such that F: [0,1] [0,1] and 
F(0) = 0, F(l) = 1. Numbers = 0 < w, <. . . < = I define a 

partition of interval [0,1] into n segments. Let F denote a function which is 
linear inside every segment and such that F{w.) - F(w.)^ / = 0,...,^?, If the 
partition defined by vector w = {w ^ , . • • • . ) minimizes the area 

between curves F and F then w fulfils the condition: for i - 1 ,...,/?- 1 
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It follows that for any / = 1 ,...,^/- 1 the minimal value of f ' \f{x)-Afdx 

for A G (0, oo) is attained at Moreover, if / is increasing then 

satisfying (2) is unique and thus provides an optimal solution. 
Generally, vectors w satisfying (1) form a set of admissible solutions in the 
problem of minimizing the area between F and F . To obtain an admissible 
vector, we choose an initial vector ,1), and modify it 

finding succesively: wl which fulfils condition (1) for / = 1 when is replaced 
by ^2 which fulfils (1) for / = 2 when is replaced by (>^i ,>^ 3 ), and 

so on until is obtained. In the next step, is treated 

as the initial vector and is the final one. This procedure is repeated until all 
thresholds satisfy condition (1). The above algorithm has been applied to many 
functions F and in each case the solution was easily obtained and found the 
same for any initial w^. 

Now, we put C*{Y.X) as F and C*(Y:X) as F, where X refers to X 
aggregated according to discretization of X* defined by vector w 
(discretization thresholds are quantiles of order Mr ). The area between the two 

curves is equal to one-sixth of p*(X,Y) - p\X,Y), and in view of condition 
(1) X is an admissible discretization of A'. It is approximately optimal when it 
is the best in a set of admissible ones obtained for various initial w^. 

If r*(Y: X)(u) = 1 - r* (Y:X)(\ - u) for u < 0.5 then for any n = 2k^ = 0.5 

and w. - \ -W 2 k-i for i - \. For binormal {X,Y) with the correlation 

coefficient p the regression function is r*{Y : X){u) - <b( j -- (cf 

Kowalczyk (1996)). In this case condition (2) has a unique solution. For 
instance, for p - 0.6 and /? = 16 , the respective thresholds w,, / = 1,...,15 are: 

0.039, 0.093, 0.153, 0.218, 0.286, 0.357, 0.428, 0.5, 0.572, 0.643, 0.714, 
0.782, 0.847, 0.907, 0.961. They define the optimal discretization of X 
according to r*(7 \ X) . 
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4. Simultaneous discretization of two variables 

Let n and 5 be the numbers of categories settled a priori to discretize X and 
7. It follows from Sec. 3 that we can find approximately optimal thresholds 
when we discretize only X or only 7 . The resulting vectors can be taken as an 
initial pair of thresholds vectors for simultaneous discretization. The 

proposed algorithm produces iteratively vectors (a/,/?/) and 
together with the corresponding values p*^ - p* {X{al\Y{pl))J 
where X{al) and 7(/?/) denote respectively variable X discretized according 
to a/ and variable 7 discretized according to /?/ . For / = 2y + 1, a/ = ; for 

i = 2j vector a] consists of n thresholds for X in the pair (Jf,7(/?.!.,)), 
derived from r* {Y{p._^)\X) as described in Sec. 3, provided that 
r* (J{Pl^y.X) is increasing; otherwise {X,Y{pl^)) is replaced by its GCA 
transformation. Similarly, when i - 2 j and vector p. consists of 5 

thresholds for 7 derived from C{X{a-_^).Y) if / = 27 + l. The sequence 
{af,P^) is defined analogously; the difference is that the procedure starts 
from discretization of 7. It is convenient to calculate both pairs (a/,/?/) and 
, P^ ) paralelly. The iterations in “path” / stop in step N when ) is 

equal to (a] ,Pl) or {a^ ,p^~) for some i < N. Then we select j and / such 
that p*! gains the maximum value and use {a],p]) as the basis for the final 
discretization of X and 7. Examples show that this discretization well 
approximates an optimal one. 

If positive dependence between X and 7 is sufficiently regular, only one 
iteration is needed since its results are equal to p^). This happened in two 
examples mentioned in the sequel. In the first example (X,Y) is binormal with 
p = 0.6, and both variables are discretized according to the 15 thresholds 
calculated in Sec. 3 (^ = 5 = 16). The resulting discretized distribution with 
non-uniform marginals is graphically presented in Fig. la; the rectangles in the 
unit square are marked with four shades of grey according to the values of the 
respective grade densities for the pairs of discretized variables: black in the 
interval , dark grey in (1, 4] , light grey in (f ,1] , and white in [0, f ] . 

The second example concerns a data set taken from the files issued in 1992 
by the Central Statistical Office of Poland (available at request from the 
author). A population of Polish farms in 1992 is taken into account, divided 
into subpopulations according to 12 geographical districts (X) and 6 
categories of farm magnitude (7). The data form a table }, where is the 
fraction of farms in district / and magnitude category j . Clustering to four 
macroregions and four levels of magnitude is performed. The resulting 4x4 
table is visualized in Fig. lb. The clustering helped to extract macroregions 
with concisely described profiles of farm magnitude (cf Szczesny et al (1998)). 
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Figure 1: Graphical presentation of the grade densities in the probability 
tables concerning: (a) the binormal discretized distribution with 
p = 0.6 and /,y = (b) the fractions of farms in macroregion 

i and magnitude category j, iJ-\,....,4. 
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Abstract: In this note we show how a two-way table can be fitted by an additive 
and multiplicative model in a robust way. This allows us to detect interactions 
between rows and columns, which can be displayed in a biplot. We show that 
the robust biplot we present here reflects the real interaction structure, and is not 
predetermined by atypical values. 

Keywords: Biplot, Robustness, Two-way table. Alternating regression. 



1. Introduction 

Multivariate data are often represented in the form of a two-way table. We will 
denote the elements of this table by yij, where 1 < t <n indicates the row index, 
and 1 <j < p the column index. The classical model for such a two-way table 
is the Anova model 

Vij — fjy Qi bj 6ij ( 1 ) 

where /x is called the overall mean, represent the row effect and bj the column 
effect. The terms 6ij can either be seen as residuals or as interaction terms be- 
tween rows and columns. The above model (1) is called the additive model. It is 
however quite possible that the interaction terms 6ij still contain some structure, 
and therefore one can model them by a factor model 

k 

^ij = ^ (2) 

To obtain symmetry between rows and columns, we assume that both f^. and Xj. 
are normed to one, so that ai needs to be considered as a scaling factor. As usual 
in factor analysis, the vectors of loadings Xj. and scores fi. are only determined 
up to an orthogonal transformation. The number of factors k is supposed to be 
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chosen in such a way that the residuals Sij resemble the most a white noise. Tak- 
ing k — 2 yields a biplot representation of the 5ij. One can present now in the 
same two-dimensional plot the rows by {fn, fi 2 ) and the columns by (A^i, Xj 2 ). 

The biplot was originally proposed by Gabriel (1971) and allows us to inves- 
tigate the row and column interaction by visual inspection of a two-dimensional 
graphical display. More information about biplots can be found in Gower and 
Hand (1996). 

Combining (1) and (2) yields an additive and multiplicative model for the 
two-way table: 

k 

Vij = M ^ + ^ij- (3) 

/=1 

In order to estimate the (/c -h l)(n 4 - p + 1) unknown parameters in the above 
model from the n x p available data points, one usually performs a singular value 
decomposition (Hoaglin et al., 1983, p. 100-118). This classical approach cor- 
responds essentially with a least squares fit of the model (3). It is however well 
known that an LS-based method is very vulnerable in the presence of outliers. In 
this paper, we will discuss a robust approach to fit the model (3). 

Robust singular value decomposition for estimation of the parameters in the 
pure multiplicative model (2) was already considered by Daigle and Rivest ( 1 992), 
who used an approach based on multivariate M-estimators. In the pure additive 
model (1) the Median Polish technique, described by Hoaglin et al. (1983) is 
quite popular and has been shown to be very robust. Median Polish is in fact an 
approximation to the Li-fit of (1) and our proposed approach can be seen as a 
generalization of Median Polish to the more complicated model (3). 

A first suggestion is to use the Li -criterion to fit the model. In fact, the Li 
approach was already shown to be very robust for the additive fit of two-way 
tables (Hubert, 1997). If we denote by 9 the vector of all unknown parameters 
in the model, and by yij[9) the corresponding fit, then this procedure tries to 
minimize the objective function 

2=1 j=l 

For the estimation of the unknown parameters we use an iterative procedure 
known as alternating regressions, which was originally proposed by Wold (1966) 
and used in the context of bilinear models by de Falguerolles and Francis (1992). 
The idea is very simple: if we take the row index i in the model equation (3) 
fixed and consider the parameters bj and Xj. as known for all j, then we see that 
a regression with intercept of the zth row of the two-way table on the k vectors of 
loadings yields estimates for and the vector of scores fi.. Reversely, if we take j 
fixed and suppose that ai and fi. are known for all i, and regress the jth column on 
the k vectors of scores, then we can update the estimates for bj and Xj.. To make 
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things robust, we will of course use a robust regression method. Working with the 
criterion (4) results in performing Li regressions. Unfortunately, Li-regression 
is sensitive to leverage points. Therefore we propose a weighted Li -regression, 
corresponding to minimizing 

E Z by - mW\wi{^)wj{0). (5) 

i=l j=l 

The weights will downweight outlying vectors of loadings and scores. Since 
the true loadings and scores are unobserved, Wi and Wj depend on the unknown 
parameters, and will be updated at each iteration step in the alternating regression 
procedure. 

In Section 2 of the paper we will discuss in more detail the algorithm we used, 
while in the last section we illustrate our approach. Moreover, many other simu- 
lations and experiments showed that the proposed procedure is highly robust and 
converges quite fast. A documented S-plus program of the implemented algo- 
rithm is freely available at http://www.statistik.tuwien.ac.at/public/filz/research. 
html. 



2. Algorithm 

We will describe an iterative algorithm for obtaining the model with both addi- 
tive plus multiplicative fit. Pure additive or multiplicative fits can be obtained 
in the same vein. In order to identify the additive part of the model we set the 
restrictions 

med(ai) = med(6o) = 0 and med(/i/) = med(Ao/) = 0, (6) 

i j i j 

for / — 1, . . . , /c. The importance of the lih term in the factor model is measured 
by the magnitude of ai. We will arrange them in descending order. To identify 
them we demand that both loadings and scores are normalized, that is 

( 7 ) 

i j 

for I — 1, . . . , /c. Note that there is in principle no orthogonality restriction be- 
tween scores fu and loadings Xji, and that we have complete symmetry in the 
indices i and j. Therefore we may assume without loss of generality that n > p. 

It is important to take starting values for the estimators which are robust, 
otherwise the procedure will loose its virtues. We set = mcdj{yij) and = 
medj (i — 1, . . . , n) to start up the iteration process. We also need to initialize 
the score matrix f-p (/ = 1, . . . , /c), which can be done by computing scores 
based on a robust principal component analysis of the data, and center and scale 
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them afterwards to meet conditions (6) and (7). Croux and Ruiz-Gazen (1996) 
proposed a fast procedure for a robust PCA. 

Suppose now that we are in step t (> 1) of the algorithm and that 

and are available. Take column j fix and consider the regression 

model 



Vij - 1^^* - af =bj + J2 fu + £ij {i ^ I, (8) 

1=1 

with unknown parameters rjji = aiXji and intercept bj. For every index i we 
compute now the corresponding weight Wi, which downweights rows with out- 
lying score vectors. Herefore we compute robust Mahalanobis distances based 
on the Minimum Volume Ellipsoid (Rousseeuw and van Zomeren, 1990) for the 
collection of score vectors . . . , yielding a sequence 

of robust distances RDi, . . . , RD„. Every distance RD* measures how far the k- 
dimensional vector is from the “center” of the data cloud formed by all n 
score vectors. Afterwards we set 



Wi = min(l, for z = 1, . . . , n. 

This choice for the weights was suggested by Simpson et al (1992). The final 
regression estimator is then obtained by minimizing 

- ib, + 

i=l 1 = 1 

which can be done by usual Li -regression. As solution we obtain estimates 
and . We repeat this procedure for every column, which implies that we need 
to perform p times an Li-fit, and this at every iteration step. Fortunately, this 
takes almost no time. Note that the weights Wi only need to be computed once 
every iteration step. 

The estimates still needs to be decomposed. This can be done by writing 
vf = ^ + med rjf , (9) 

with 

= 11 ^ 7 ’ - med7?jj’|| and = {rff} - medzjjf )/af^ 

The last term of (9) enters the additive part 






+E/r'> 
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We end the first part of iteration step t by updating the overall effect = 
medj af ^ + medj and centering row and column effects around their 

median in order to fulfill restrictions (6) and (7). 

In the second part of the fth step of this alternating regression algorithm we 
consider for each row z = 1, . . . , n the regression model 

Vij - - hf = a; + ^ for j = 1, . . . , p (10) 

l = l 

with unknown parameters = fuai and ai and where and are ob- 

tained from the first part and are treated as fix. All estimates for step t are now 
obtained and updated in exactly the same way as above. The weights Wj are now 
based on the robust distances computed from the loadings vectors. The iterative 
algorithm will be stopped if there is no essential decrease in the objective func- 
tion any more. We have no formal proof for the convergence of the algorithm, but 
according to our experience convergence can be expected after 10 to 20 iteration 
steps. 

A graphical display of the interactions between rows and columns is obtained 
as an additional output of our program. A constant a for scaling variables and 
objects can be chosen. Besides that, there is the option of orthogonalizing the 
finally obtained factors and scores. The program also allows to use other regres- 
sion estimators, like the Least Trimmed Squares estimator. Taking the non ro- 
bust Least Squares estimator corresponds with the classical approach of Gabriel 
(1978). Our experience however is that the weighted Li-estimator gives the most 
satisfying results among the estimators we tried out. 



3. Example 

As an example we created an artificial data set of size 10 x 15. We generated the 
data according to model (3) with randomly generated coefficients. Afterwards, 
four observations were replaced by severe outliers: cells (1,1), (1,2), (4,4), and 
(8,8). We fitted an additive and multiplicative model with the described robust 
procedure and with the usual LS-approach. As benchmark, we also computed 
an LS-fit on the original, clean data. In Figures 1-3 we plotted the correspond- 
ing biplots. The arrows indicate the positions (Aji, A^ 2 ) for j 1, . . .p of the 
columns (also called variables here), while the rows are also indicated at posi- 
tions (/ji, fi 2 ) for z = 1, . . . rz. We also plotted three dimensional barplots of 
the residuals Sij, which are useful in detecting outliers. The idea is that the bi- 
plot represents the general interaction structure between rows and columns, while 
outliers show up in the barplot. 

We clearly see that the robust biplot represents the structure of the good ob- 
servations, without being influenced too much by the outliers. It resembles the 
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usual biplot based on the clean data, showing the robustness of our procedure. 
The four outliers are clearly indicated by the barplot. 

The classical biplot is completely determined by the four outliers. Informa- 
tion about the other 146 cells can hardly be obtained from the biplot. The barplot 
of the LS-based residuals only retrieves cell (4,4) as an outlier, while some other 
cells, like (4,1), have a huge residual while not being true outliers. Finally, the 
barplot of the residuals of the LS-fit based on the clean data does not reveal any 
outliers, as it should be. 




Figure 1: Biplot and barplot of the residuals for the robust procedure based on 
the contaminated data 




Figure 2: Biplot and barplot of the residuals for the LS-procedure based on the 
contaminated data 
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Figure 3: Biplot and barplot of the residuals for the LS-procedure based on the 
clean data 
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Abstract: We introduce a graphical factor analysis model as a graphical 
Gaussian model with latent variables satisfying a set of conditional 
independence constraints. After a brief introduction of the factor analysis 
model, we generalise the class of such models by allowing the concentration 
matrix of the residuals to have non-zero off-diagonal elements. The study of the 
associations left unexplained by the latent factors allows a better interpretation 
of the model. We concentrate on models with one latent variable, for which the 
identifiability condition is well established. A real data example is then 
presented to clarify the ideas. Two model selection procedures are presented, 
one based on the calculations of deviance differences and the other based on the 
calculation of the posterior probability of the model. Given the analytical 
intractability of the latter, we propose a Markov Chain Monte Carlo method to 
approximate both the model probabilities and the inferences on the quantities of 
interest. 

Key words: Factor analysis models. Graphical Gaussian models, Identifiability, 
Markov Chain Monte Carlo methods. Model selection. 



1, Model specification 

Results concerning latent variables may be usefully applied in the graphical 
modelling context. The role and relevance of latent variables depend upon their 
interpretation. We can distinguish between situations where latent variables 
have a substantive interpretation as hidden variables and situations where they 
are merely a computational artefact. 

More precisely, the first status typically derives from psychometric factor 
analysis contexts, where the aim is to measure underlying concepts with a 
psychological meaning. The second status is less established and essentially 
deals with the idea that the joint distribution of the observed random variables 
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can be conveniently simplified by introducing some further random variables 
which are not observed and typically do not have any ‘a priori’ substantive 
meaning. A noticeable example of the latter occurs in exploratory factor 
analysis, where the aim is to find the smallest number of latent variables such 
that the observed variables become independent conditionally on them. 

We consider the factor analysis model: 

X=A^ + 8 (1) 

where X is a pxl vector of observed variables, A is a pxq matrix of factor 
loadings, ^ is a q dimensional vector of latent variables, and 8 is a pxl vector 
of residuals. We assume 8 has the multivariate Gaussian distribution N(O,0) 
with 0 a positive definite symmetric matrix. We also assume E(8^') = 0. The X 
and ^ variables are written as deviations from their mean, and E(^^')=0. To 
avoid a trivial identification problem we set elements of the main diagonal of 
this matrix equal to one. Under these assumptions it turns out that the 
covariance matrix of X is: 



Z=AOA' + 0. (2) 

In the classical factor analysis framework, all the associations among the 
observed variables are explained by the unobserved factors, so that 0 is 
assumed to be a diagonal matrix. Here we will see the classical factor analysis 
model as a graphical Gaussian model, with the peculiarity that some random 
variables are not observed. In this sense, the concentration matrix of residuals 
0 will be allowed to have non-zero off-diagonal elements. The resulting model 
is named a graphical factor analysis model. 

When latent variables have a substantive meaning, the stucture of associations 
among the residuals helps in improving the interpretation of the model by 
giving information about the substantive correlations unexplained by the 
unobserved variables. On the other hand, when latent variables are 
computational artefacts, such as in exploratory factor analysis, this procedure 
can be seen as an intermediate step between fitting a model with one factor and 
a model with more factors. We also believe that this model may also be relevant 
for highly structured stochastic systems where latent variables may help to elicit 
conditional independencies. 

A relevant problem in specification of a graphical factor analysis model is the 
identifiability of the parameters of the model. If (2) admits a unique solution (or 
at least a finite number of solutions) in A, O and 0, then the model is globally 
identified. Sufficient and/or necessary conditions concerning identifiability are 
available in the literature (see Bollen, 1990, for a review). However, they all 
concern models where 0 is a diagonal matrix. A sufficient condition for 
identifiability of a single- factor (q=l) model with correlated residuals has been 
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proposed by Stanghellini (1997). Such a condition can be extended to a generic 
number of factors, and we are currently examining this research issue. 



3. Model comparison 

Let {G^ ... , G"'} be a collection of graphical models (see e.g. Lauritzen, 
1996), describing alternative association structures for Xv=(X,^), where 4 are 
the unobserved random variables. Consider a random sample of n p-variate 
observations from Pg, X^"^ = x^”l By quantitative learning we mean that, 
having fixed G, the evidence is used to estimate the unknown parameter Eg- On 
the other hand, structural learning has the objective of establishing which 
graphs are better supported by the data and the prior information available. 

Our objective is to perform structural learning. This will be done in two ways: 
1) by means of a deviance-based sequential testing procedure; 2) by comparing 
the posterior probability of the graphs, corresponding to alternative conditional 
independence constraints. 

When only one latent factor is considered, the graphical factor analysis model 
introduced in Section 1 specifies that p(x\^) = N(>- ^,©g) and p(^) = N(0,1), 
where ©g is such that the elements of its inverse, say satisfy the pairwise 
Markov property: 



9‘j=0 oi^-j[G] 



where the symbol i ^-j[G] means that i and j are not adjacent in the graph G. 
Concerning the proposed Bayesian model selection procedure, we need to 
derive the posterior probability of each graph. For each graph G such a 
probability is defined by Bayes' theorem: p(G|x^"^) oc p(x^^^\G) p{G\ where the 
prior probability p{G) will be taken as uniform. We thus need to calculate the 
marginal likelihood p(x^'^^\G). This requires the specification, for each G, of a 
prior distribution /?(©| G) on the p(p+l)/2 covariance parameters characterising 
the graphical factor model corresponding to G. This will be done following the 
approach in Giudici and Green (1997), taking © as Hyper Inverse Wishart with 
parameters a=p and O =1. Finally, as a prior distribution on the loading 
coefficients, we shall take X\ ~ N(/, m), for i=l,...,p and (/, m) constants to be 
specified. 



3. Computational methods 

We performed the ML estimates via the ECM algorithm (Meng and Rubin, 
1994). Details of the implementation of the ECM algorithm for the models we 
consider are given in Stanghellini (1995). Each model is tested against the 
model with no constraints on E via the likelihood ratio (deviance) test. 
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On the other hand, the calculation of the marginal likelihood cannot be 

performed analytically, because of the presence of the latent variables, which 
need to be integrated out. There are other reasons which suggest considering an 
alternative, simulation- based approach. For instance: a) in realistic situations 
(such as probabilistic expert systems), the number of random variables 
considered is very high and, thus, the number of possible graphs forbids an 
exhaustive model search; b) it may be important to derive quantitative 
inferences on functions of E, such as the partial correlation coefficients. 

We propose, but do not detail here, for lack of space, a Markov Chain Monte 
Carlo method to derive approximately both the posterior model probabilities 
and quantitative inference on any quantities of interest. The design of such a 
sampler is based on a proposal by Green (1995), known as reversible jump 
Markov Chain Monte Carlo method. We extend recent work of Giudici and 
Green (1997), by including latent variables in graphical Gaussian models. 
Briefly, our algorithm contemplates, at each stage of the Markov Chain, a 
systematic scan through three possible moves: a) adding/dropping an edge from 
the current graph, yet remaining within the class of decomposable models; b) 
updating the variance-covariance matrix 0g; c) updating the latent parameters ^ 
and X. For each of the moves, we simulate a candidate (proposal) new value of 
either G, 0 or (^,^) and then accept it with probability determined by the 
Metropolis-Hastings acceptance ratio (see e.g. Green, 1995). The algorithm is 
very fast, due to simple proposal distributions adopted and the localisation of all 
the Metropolis-Hastings ratios on the cliques of the current graph. 



4. Hospital recoveries 

We analyse a set of variables concerning functional ability indices (see 
McDowell and Newell, 1987, for an introduction to such measures). The 
sample consists of 70 people recovering in a hospital. The average age of the 
people is 82.71. The interest is in studying the relationships between different 
psycho-cognitive attitudes. The measures are from 1 to 7 (1 is total dependence) 
and are assigned to each person by the physicians carrying out the study, after a 
follow-up period of 6 months. Only the people who completed the study are 
included in the sample. The variables concern the people's attitudes towards: 
self-care (Yi), comprehension (Y2), expression (Y3), interaction with other 
people (Y4), solving problems (Y5) and memory (Ya). 

The covariances and partial correlation coefficients are reported in Table 1. We 
remark that there is a positive correlation between the age and the measures, 
which is possibly due to the process by which the sample is selected. To check 
the validity of the linearity assumption of the Y variables we performed the 
exploratory battery of tests as suggested by Cox and Wermuth (1994). There is 
a small non-linearity involving the Y 5 variable, which can be removed by 
taking the log of such a variable. 
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There is an unobserved heterogeneity among the people, depending on the 
varying level of severity of their illness. Therefore, we assume that there is an 
underlying unobserved variable which takes into account the severity of their 
illness. The analysis of the association is then performed after conditioning on 
the latent variable. As explained in Section 1, this coincides with fitting a 
graphical Gaussian model on the concentration matrix 0'^ of the residuals. 

We performed our analyses both in the classical (MLE) and Bayesian 
framework. 

Table 1: Marginal covariances (lower triangle) and partial correlations (upper triangle). 



Variables 


Yi 


Y 2 


Y 3 


Y 4 


Ys 


Y 6 


Y, 


3.539 


-0.060 


0.114 


0.237 


0.283 


-0.046 


Y 2 


2.088 


3.307 


0.719 


-0.085 


-0.094 


0.425 


Y 3 


2.352 


3.126 


3.463 


0.410 


0.068 


-0.079 


Y 4 


2.618 


2.784 


3.083 


3.761 


0.213 


0.106 


Y 5 


2.283 


2.347 


2.477 


2.662 


2.901 


0.551 


Y6 


2.346 


3.017 


2.978 


2.997 


2.867 


3.806 



We first comment on the results of the former. The graphical factor model 
which exhibits the best fitting (deviance=5.89, with 6 degrees of freedom) has 
and significantly different from zero. This model is the result of a 
forward selection procedure, starting with a single-factor with 0 diagonal 
(deviance 52.11, with 9 degrees of freedom, showing that one factor does not 
explain all the correlations among the observed variables) and adding at each 
step the most significant edge. The sufficient condition for identifiability of the 
graph (see Stanghellini, 1997) is satisfied, therefore the selected model is 
identified. 

We observe that the 0 matrix is block-diagonal, showing that the Yi and Y 4 
variables, measuring the self-care and the interaction with other people, are 
independent from the other measures, given the latent factor. The other 
variables remain correlated also after conditioning on the latent variable, i.e., in 
0 there are no more marginal independencies. However, these correlations are 
in large part explained via the common associations of these variables with Y 2 
and Y 5 measuring, in order, the comprehension and the capability of solving 
problems of the person. As already noted, the selected model can be seen as an 
intermediate step between fitting a single-factor model and other factor analysis 
models that involve a higher number of latent variables. This analysis points out 
which associations are left unexplained by the first factor, therefore giving 
insights about the meaning of a second factor. We also note that the conditional 
independence graph of the observed variables can be derived by marginalising 
on the latent variables. This marginalisation can be performed using graphical 
rules (see Cox and Wermuth, 1996, Ch. 8 for details). 

The performed Bayesian structural learning, based on 100,000 MCMC 
iterations plus 10,000 initial ones which are discarded (burn-in period), shows a 
good stability of the simulated results. As a general finding, the models selected 
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are less parsimonious. This result, which indeed occurs also for the marginal 
graphical model without latent variables, mainly depends on the assumed prior 
distribution, which particularly favours the independence graph, therefore 
conflicting with the available evidence. 

The best model receives about 54% of the posterior probability and contains 9 
conditional associations of the Y variables, given the unobserved one, 
expressed by the and parameters 

significantly different from zero. It is not identified. The second best model, 
with about 27% of the posterior probability, is more parsimonious, with 

1 ^^"^ and different form zero. It is identified. Both models contain 
the one selected via the ML procedure. This second model contains two more 
conditional associations on the Y variables then the model selected via ML. 
These concern associations between the self-care (Yi) and solving problems 
(Y5) capabilities, and between comprehension (Y3) and interaction (Y4) 
capabilities. Notice that the two models alone receive about 81% of the 
posterior probability, a result which is remarkable having considered a total of 
about 26,000 candidate decomposable models. 
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Abstract: The aim of this study is to extend Nishisato’s(1980) quantifi- 
cation method of fully ranked data, using Gabriers(1971) biplot technique 
for multivariate data matrix. Specifically, we will propose a row plot, in 
which the interdistance between two plot points can be interpreted as ap- 
proximation of the (squared) rank distance between two rankings given by 
corresponding judges in Spearman’s sense. Similarly, we will also propose 
a column plot for objects with a sensible relationship to the row plot for 
judges. 

Key words: quantification, row plot, column plot, biplot. 



1. Introduction 

Ranked data are widely used in the area of social sciences, for instance 
in polls and preference surveys, in which a number of objects (or stimuli) 
are evaluated and ranked by a panel of judges(or subjects) according to 
their preference. Ranked data drew attention of many researchers such as 
Nishisato(1980), Critchlow(1985), Baba(1986) and Diaconis(1988) among 
others. 

Let Tij denote the rank given to the object j{= l,...,p) by the judge 
1, ..., n), and write R — {vij}. Although this matrix notation is conventional, 
we will use other coding scheme for fully ranked data in this study. The 
coding scheme is 5 = with n rows and p columns, where 

Sij = Tij “(pH" l)/2. 

The S notation will be useful for later use of matrix decomposition, since 
redundant row averages, (p-hl)/2, of the raw data matrix R are removed 
in 5. In Spearman’s sense, the squared rank distance between two rows(or 
judges) i and i' is defined by 
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4(*.0 = 

2. Quantification of Fully Ranked Data 

For lower dimensional reduction technique of multivariate data, we will re- 
sort to Lebart, Morineau, and Warwick(1984; Chapter 1). The rows Si of 
S can be considered as vectors in Let be a unit vector in The 
magnitude of projection vector of S{ on v is equal to s[v. Our objective is 
to maximize the sum of squared inner products. Then, the optimization 
problem can be formulated as 



max subject to v'v = 1. 

1=1 

Adopting Lagrangian multiplier method, principal component(PC) reduc- 
tion can be derived from an eigensystem 

{S'S)v = V G W. 

More efficiently, biplot of fully ranked data can be obtained from singular 
value decomposition(SVD) of the data matrix S. Let 

S = UDV', 

where U is the nx p matrix with orthonormal columns, V = (t'l, t’ 2 , * * * , ^p) 
is the p X p orthogonal matrix, and D is the p x p diagonal matrix with 
singular values, Ai > • • • > Ap_i > (Ap = 0) as its diagonal elements. Then 
two-dimensional row plot points are given by the rows of 

G^(2) = S{vi : V2) = UDV\vi : V2) = U(2)D^2), 

where U{2) is the nx2 submatrix of U and £>( 2 ) is the 2x2 diagonal submatrix 
of D. Then the interpoint distances between two points in the row plot are 
approximations of squared Spearman rank distance between corresponding 
rankings given by judges with overall goodness-of-approximation 



= 1 - ||5V - (G^2) : 0„x(p-2))||Vll5V|p. 



It turns out that 



GOArou,i2) = {A? + + • • • + Xl). 
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For the plot of columns(or objects), we use the same PC axis vector v and a 
group of hypothetical supplementary rows(or judges). Specifically, to locate 
the first object for instance, consider (p — 1)! rankings 

(l,2,3,..-,p),(l,2,p,...,3),(l,3,2,...,p),..., 

which give rank 1 to the first object. Then the centroid of these rankings is 
given by 

Cl = (1, (p + 2)/2, ip + 2)/2, • • • , (p + 2)/2)'. 

When the size p vector C\ is projected on the PC axis vector v, it is positioned 
at c\v. Similarly , Cj are defined for j = 2, • • • ,p and their projections on v 
can be carried out individually. Finally, the p objects can be positioned, say 
in a column plot, at 



(c'it;i, c\v 2 ), (c'jiJi, c' 2 V 2 ), (c 3 ^;i, c^va), • • • , {c'pVu c'pVi). 



For the column plot, first note that 

cj = {p + 2)/2 {1, ■■■,!)' -p/2 

and that 

djV = (p+ 2)/2 (1, • • ■ , l)t; -p/2 (0, • • • , 1, • • • ,0> 

= -p/2 

since 5 1 = 0 for 1 = (1, • • • , 1)', by coding definition, implies that 
U'iUDV'l) = 0 or DV'l = 0, 

or Vjl = 0 for j = 1, ...,p — 1. Hence the column points for objects in the 
two dimensional plot are given by the rows of 

r^'ii 






[ui : ua] = -p/2 /p[i;i : ua] = -p/2 [ui : V2]. 






Therefore, two-dimensional column plot points can be obtained from the 
rows of 

HU = V(2), 

where V(a) is the n x 2 submatrix of V . Then, the goodness-of-approximation 
of the two dimensional column plot is given by 



GOAeoi(a) = 1 - IIVIp, - (V^(2) : = 2/(p - 1), 
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where Vjpj is the p x (p - 1) submatrix of V. The biplot, produced by 
combining row and column plots, can be interpret able as follows. First, 
note that S can be expressed as 

5 = (UD)V' ^ ([/(2) A2))V('2) = q2)-f^(2)', 

For Sij, the (i,j) element of 5, we can write 

where g-^ 2 ) ^j( 2 ) vectors of G* 2 ) and H^ 2 ) • Thus the 

raw data elements are recovered by the inner products of row and column 
plot vectors. 



3. A Numerical Example 

To illustrate the proposed quantification plots, consider a ranked data from 
Nishisato and Nishisato(1984) in which 31 judges were asked to rank ten 
government services(and facilities) according to the order of “being satisfac- 
tory” . The list of services or objects are: 



A = public transit, 

C = medical care, 

E = police protection, 
G = street cleaning, 

I = theatres, 



B = postal service, 

D = sports/recreational facilities, 
F = public libraries, 

H = restaurants, 

J = planning and development. 



to which are given the rank 1 for “most satisfactory”,..., the rank 10 for 
“least satisfactory” . 

Row plot by quantification is shown in Figure 1 with the goodness-of-approxi- 
mation 60%. Also, column plot by quantification are shown in Figure 2 with 
the goodness-of- approximation 22%. By superimposing the row and column 
plots, we obtain the biplot of fully ranked data. 

In Figure 1, the largest cluster of judges is found in the right-side region along 
the first axis, of which corresponding position in Figure 2 is assigned to the 
object B (postal service). It means that this group of judges agree to put B 
as “least satisfactory”. Also, in Figure 1, the second largest cluster of judges 
is located in the upper-side region along the second axis. Their opposite po- 
sition in Figure 2 is occupied by the objects I(theatres) and H (restaurants). 
So, we may interpret that the second largest group of judges evaluate the- 
atres and restaurants very highly. 
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Figure 1. Row plot for 
thirty- one judges 
by quantification 
(GO^o^(2) = 60%) 



Figure 2. Column plot for 
ten government services 
by quantification 

{GOA^i^2) = 22 %) 




4. Concluding Remarks 

Nishisato(1980) considered two types of ranked data, paired comparison type 
and general ranking type. He considered the typical aregument of dual scal- 
ing. But, he did not plot for judges. Also, he did not consider the geometry 
of the ranked data matrix. Taguri, Hiramatsu, Kittaka and Wakimoto(1976) 
suggested a graphical representation of rank correlation coefficient which en- 
ables to show graphically the correlations between objective variable and the 
several explanatory variables and exhibits changes of correlations in part. 
Also, they defined a measure named as the area ratio correlation coefficient 
to represent the degree of correlation based on the graph which they call 
the linked vector pattern. Baba et al.(1984) proposed a linked rank graph 
and a rank graph. The former represents the distribution of ranks, while the 
latter represents the average ranks and the degree of concordance. Their 
motivations and ideas behind come from the linked vector pattern. The 
relations among the three graphs are as follows: the rank graph is a result 
from the drawing of the linked rank graph, and the concordance chart is 
obtained by connecting the vectors on the rank graph. These graphs give 
us good information about the typical rank given to objects. However, they 
did not consider the graph of judges. Baba(1986) also proposed a graphical 
method for rank test, which is based on the rank graph. He showed that in 
case of large sample, the null hypothesis that ranks are given at random to 
each object (or item) can be tested visually by the rank graph with a critical 






374 



ellipse. Other graphical method for ranked data include multidimensional 
scaling, minimal spanning trees and nearest neighbor graphs as discussed by 
Diaconis(1988). The method we proposed is such a nice graphical tool that 
subjects (or judges) as well as objects can be plotted to show off mutual rela- 
tionships. One particular merit in the row plot is that Spearman’s distances 
between rankings given by judges are approximately preserved. Moreover, 
the methodology can be used for segmentation of judges(or observations). 
By modifying coding schemes, we may produce biplot in which Kendall’s 
rank distances instead of Spearman’s are preserved approximately. 
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Abstract: There are many causes of occurrence of improper solutions in 
factor analysis. Identifying potential causes of the improper solutions gives 
very useful information on suitability of the model considered for a data set. 

This paper studies possible causes of improper solutions in exploratory 
factor analysis, focusing upon (A) sampling fluctuations, (B) model underi- 
dentifiable and (C) model unfitted, each having several more detailed items. 
We then give a checklist to identify the cause of the improper solution 
obtained and suggest a method of reanalysis of the data set for each cause. 

Keywords: Covariance Structure Analysis, Checklist, Underidentifiability, 
Sample Fluctuations. 



1. Factor Analysis Model and Improper Solution 

In factor analysis, an observed random p-vector x is assumed to have 
the following form: x = Af H- u, where A = (A^j) is a p x fc matrix 
of factor loadings, / = [Fi, - is a fc-vector of common factors, 

'^ = [Uir ” y Up]' being a p-vector of unique factors. Here k is the number 
of factors. Assume further that Var(/) = 4, Cov(/,u) = 0, Var(i6) = ^ = 
diag(^i, • • , V^p). The covariance matrix S (= (cTjj)) of the observed vector 
X is representable as E = AA' + Each diagonal element of is a 
variance of Ui, so that it should be estimated as a positive value. It is said 
to be an improper solution or a Heywood case when some elements are 
not positively estimated. 

2. Cause, Identification and Treatment 

Following van Driel (1978), we distinguish among three types of causes 
of improper solutions as in Table 1: (A) sampling fluctuations, (B) model 
underidentifiable, and (C) model unfitted. We shall make brief comments on 
these causes. Since the parameter space of 'ipi is the finite interval (0, 
and estimation methods naturally do not require estimates ipi to be in the 
interval, can be outside the parameter space or can be at its boundary 
at a positive probability, because of sampling variations. A typical example 
is an improper solution which takes place in a simulation study under a 
true identified model. Ihara and Okamoto (1985) compared estimation meth- 
ods such as ML and LS in terms of frequency of improper solutions due 
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( A: sampling fluctuation 



B: underidentifiability 









Ai 0 


Bl: A = 






0 




X\k 




^2k 


> 

II 


Ai 0 




0 


[ B3: others 



Cl 

C: factor model unfitted ^ C2 

C3 

\ 

D: others (e.g., outliers) 



( only one nonzeroX 
element in a col- I 
umn of A / 



( only two nonzero^ 
elements in a col- j 
umn of A / 



some true unique variances < 0 
inconsistent variables Xi included 
others (e.g., minor factors) 



Table 1 : Types of causes of improper solutions 

to sampling fluctuations. Anderson and Gerbing (1984), Boomsma (1985) 
and Gerbing and Anderson (1985) have studied how model characteristics 
influence on the frequency of occurrence of improper solutions in the con- 
text of confirmatory factor analysis. In my experience, however, improper 
solutions due to sampling fluctuations are not so often met in practice. 
When sampling fluctuations cause an improper solution, it may be useful to 
constrain uniqueness estimates ifi to be nonnegative. Gerbing and Ander- 
son (1987) discussed interpretability of constrained estimates for improper 
solutions caused by sampling fluctuations in confirmatory factor analysis. 

The causes (B) and (C) are important in model inspection. Anderson 
and Rubin (1956) gave a necessary condition for identification that there be 
at least three nonzero elements in each column in A. The cases (Bl) and 
(B2) in Table 1 violate the necessary condition. 

In (Bl), the parameters {Xik.'ipi) can take any values as far as they meet 
SO that ipi can be negative. The location of the 
nonzero loading in the fc-th column is arbitrary. The fc-th common factor is 
not a common factor but a unique factor. Thus, it is actually a (fc — 1) -factor 
model. Accordingly, an improper solution takes place when the number of 
factors is overestimated. Why is it overestimated? In many cases, a test of 
goodness of fit suggests it, that is, the test rejects a (fc — l)-factor model. 
There would be two potential cases yielding this anomaly ((fc — 1) -factor 
model is rejected; examination of a fc-factor solution suggests a (fc — l)-factor 
model): 
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(Bll) Sample size n is large enough to reject reasonably well-fitted models; 
in other words, because of large samples, the statistical test becomes 
too sensitive to small deviation from the model, possibly caused by 
minor factors; 

(B12) Distribution assumptions such as normality are violated, and then the 
distribution of test statistics can not be approximated by a chi-square 
distribution. 

The both cases have been pointed out in the context of covariance structure 
analysis. For (Bll), researchers are advised to use goodness of fit indices 
such as GFI, CFI and RMSEA to measure the distance of the population 
from the model (e.g., Joreskog and Sorbom 1993 section 4.5.2; Rentier 1995 
chapter 5). They would then get useful information concerning acceptability 
of the model. The cause (Bll) is also interpreted as an effect of minor 
factors. In (B12), researchers can take elliptical theory or a type of asymp- 
totically distribution-free (ADF) method. Kano (1990) suggests a noniterative 
estimation procedure which prevents a unique factor from being reinterpreted 
as a common factor. 



Checking Item 


Cause of Improper Solutions 
A B1 - B2 Cl 


(1) Does iteration converge? 


yes no no yes 


(2) Is the solution stable? (not 
depending largely on estima- 
tion methods, starting values, 
nor optimization algorithms) 


, , . unstable 

unstable m one . , 

yes in two particu- yes 

lar X^’s 


(3) Are SE’sf of almost the 
same in magnitude? 


, one or two 

yes one large SE yes 


(4) Does the confidence inter- 
val of 'ipi, not positively esti- 
mated, contain zero? 


yes no no no 


(5) Are residual elements of 
S — (AA' -h almost the same 
in magnitude? 


no for solution 

yes yes reducing k by yes 

one 



Table 2: Checklist for identifying the cause of improper solutions. See text 
for examining C2. ^SE denotes standard error. 



In (B2) an improper solution occurs in the first or second variables 
only, because the loadings Ai^ and \ 2 k can take any value as long as they 
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satisfy XikX 2 k = (^12 ~ Xir\ 2 r- One can not make exploratory factor 
analysis when the population factor loading matrix A has the form (B2). In 
the case, the researchers can not help making a constraint to remove the 
indefmiteness, such as Ai^ == \ 2 k or 'ipi = 'ijj 2 . It is optional whether to 
impose = 0 (z = 3, • • • ,p). See Kano (1997) for details. 

When an improper solution occurs due to (B), underidentifiability, it often 
happens that iteration does not terminate; the solution depends on starting 
values, estimation methods (e.g., ML, GLS, LS) or optimization algorithms. 
In (Bl) the location of only one nonzero element, or the variable in which 
'ipi < 0 is arbitrary and it also depends on starting values etc. As noted 
above, negative estimates can appear at two particular variables for (B2). 
These observations will distinguish between (Bl) and (B2). 

For (Cl) or (C2), researchers have to remove all variables inconsistent 
with the model considered. They can remove the variables with negative 
unique variances in the case (Cl). In (C2) we could examine residuals to 
identify inconsistent variables. A more sophisticated manner would be to 
take a likelihood ratio test approach, developed by Kano and Ihara (1994). 

The case (D) contains all causes other than (A)-(C). Outliers in samples 
may cause improper solutions, as pointed out by Bollen (1987). Existence 
of outliers is classified in the case (D). The other cases in (D) are yet 
unknown and still need to be studied. 

In Table 2, we summarize as a checklist how to identify the cause of 
improper solutions. Table 3 presents the method of reanalysis for each cause 
of improper solutions. 



Cause 


Treatment 


A 


Obtain a boundary solution with all 'tpi > 0 


Bll 


Refer to goodness of fit indices such as GFI and CFI 


B12 


Apply an ADF type of estimation method 


B2 


Estimate under constraint such as Ai;t = \ 2 k or tpi = 'ijj 2 


Cl 


Remove the variable Xi with < 0 


C2 


Remove inconsistent variables 



Table 3: Treatment after identifying the cause of improper solutions 



3. Example 

Maxwell (1961) conducted maximum likelihood factor analysis of each of 
two samples of 810 normal children, and 148 neurotic children attending a 
psychiatric clinic, where the first five items of the samples are cognitive tests 
and the other five are inventories for assessing orectic tendencies (see Table 
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7 for items). He found that the iterative process for obtaining the MLE 
does not terminate and the communality of the eighth variable approaches 
to one for 4-factor model for the normal sample whereas a 3 -factor model 
successfiilly analyzes the sample of neurotic children. He concluded that the 
method of maximum likelihood along with goodness-of-fit testing does not 
always perform well. It later turned out that the 4-factor solution for the 
normal sample is improper (e.g., Joreskog 1967). 

Here, we shall take the sample of normal children to illustrate our 
procedure for improper solutions. In the sample, Sato (1987), among others, 
reported that the improper solution depends on initial estimates for iteration 
and gave three different improper solutions with ipQ, ^8 or ^9 negative, 
respectively, and that it is difficult to achieve convergence in iteration when 
uniqueness estimates are not constrained to be nonnegative. Table 4 shows 
MLE for uniqueness and their standard errors, for each case of 0 < 'ipi < oo 
and —ooK'ipiKoo. The analysis was made with a covariance structure 
program Eqs, developed by Rentier (1995). 





goodness-of-fit 
Xii -value P- value 




^2 






0 < i^i < oo 


MLE 


18.447 


0.072 


384 


621 


302 


638 




SE 






037 


059 


039 


037 


— 00 < Ipi < oo 


MLE 


14.589 


0.202 


397 


629 


298 


644 




SE 






037 


056 


040 


037 





V’s 


V’e 


^7 


V’8 


V'9 


V'lO 


0 < < 00 


MLE 


352 


778 


287 


000 


690 


599 




SE 


096 


040 


052 


003 


037 


050 


— oo < Ipi < oo 


MLE 


324 


802 


275 


-276614 


725 


609 




SE 


108 


042 


046 


156 


040 


044 



Table 4: Uniqueness estimates (MLE) and their standard errors (SE) in 
MaxweWs data {n = 810; fc = 4). Values are multiplied by 1000. 



Obviously, the cause of the improper solution is not “A: sampling 
fluctuations.” Maybe we should consider “B: identifiability” as a possible 
cause. Table 5 shows the list of top five (standardized) residuals in absolute 
value in 3 -factor solution. There is no salient residual in the list, which 
implies that (B2) is not the cause. To conclude that the cause is (Bl), 
we have to examine (Bll) and (B12), but unfortunately, we can not check 
normality of observations because raw data are not available. We would say 
that the sample size n = 810 is so large that the power of the goodness-of-fit 
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test is raised too much. In fact, other fit indices indicate reasonable fit of 
the 3-factor model, for instance, GFI=0.981; CFI=0.973. As a conclusion, 



X9-X8 X8-X6 X10-X8 X8-X4 X9-X6 

0.095 0.084 -0.050 0.046 -0.044 

Table 5: Top five residuals in 3- factor solution 



a probable cause of the improper solution is (Bl) and the analysis here 
suggests a 3-factor model for the sample of normal children, as well as for 
the sample of neurotic children. 

Any model is nothing but an approximation to reality, and deviation of 
a model from reality always exists. Statistical test can detect the deviation 
even when it is very small, provided that the sample size gets large. There 
are two considerations: (i) one employs a slightly misspecified model if one 
considers the deviation as just an error and so negligible; (ii) one rejects 
the model and finds a suitable treatment to reduce the deviation. 

The treatment above for the improper solution of the normal sample is 
based on the consideration (i). There is an alternative story based on (ii). 
The cause of the improper solution is then (C) in this story. The key can 
be found in the list of the residuals in Table 5. 

In Table 5, we can find that the top four residuals are related to the 
eighth variable. This indicates the possibility that the eighth variable be 
inconsistent with the model under consideration. The inconsistency could be 
a cause of the rejection of the 3-factor model. Table 6 shows the goodness- 
of-fit chi-square test statistics of 10 models, each of which is formed by 
removing one of 10 variables. The only accepted model is the one in which 
the eighth variable is removed. As a result, the eighth variable can be 
regarded as inconsistent with the 3-factor model. See Kano and Ihara (1994) 
for details. 





Variable Deleted 




1 


2 


3 


4 


5 6 


7 


8 


9 


10 


LRT 


35.32 


37.38 


48.28 


65.37 


21.63 55.89 


40.73 


14.58 


45.68 


38.47 



Table 6: Xi 2 of 10 models after deletion of one variable. (Xi 2 (*^^) = 

18.55) 



Table 7 shows the MLE in 3-factor model after deletion of the eighth 
variable. 
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Factor 1 Factor 2 Factor 3 I Communal. 

COGNITIVE TESTS 



XI 


Verbal Ability 


574 


412 


323 


1 


603 


X2 


Spatial Ability 


095 


585 


139 


1 


371 


X3 


Reasoning 


406 


697 


225 


1 


702 


X4 


Numerical Ability 


325 


487 


117 


1 


356 


X5 Verbal Fluency 
ORECTIC TENDENCIES 


780 


243 


092 


1 

1 


675 


X6 


Neuroticism Questionnaire 


104 


175 


396 


1 


198 


X7 


Way to be different 


228 


143 


808 


1 


725 


X8 


Worries and Anxiety 


— 


— 


— 


1 


— 


X9 


Interests 


103 


180 


482 


1 


275 


XIO Annoyances 


000 


028 


625 


1 


391 



Table 7: 3-f actor solution^ rotated by normalized- VARIM AX, of 9 vari- 
ables after deletion of the 8th variable. 14.58 (n = 810^, P- 

value=.2712. Estimates are multiplied by 1000. 



We have here suggested two possibilities for treatment of the improper 
solution in Maxwell's data. Which approach is to be taken, in others 
words, whether the deviation of the 3-factor model using 10 variables can 
be considered small enough or not, may depend on researchers and also on 
interpretability of those two results of the analyses. 

4. Remarks 

Users may not implement the procedure described above, if they use usual 
exploratory factor analysis (EFA) programs only. It is absolutely necessary to 
use programs with which the user can (i) do analysis under no constraint on 
'ipi, that is, ipi can take negative values; (ii) get standard errors of estimates, 
particularly, of ifii (iii) specify starting values; and (iv) get (standardized) 
residuals, S - (AA' -h ^). For this, covariance structure analysis (CSA) 
programs (e.g., Amos, Eqs, Lisrel) are very useful, although they do not 
have the option of factor rotation. Researchers are recommended to use both 
CSA and EFA programs to make exploratory factor analysis in a proper 
way. 
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Abstract : We start by recalling the basic clustering algorithm for symbolic 
clustering using the hierarchical and pyramidal models. This algorithm allows to 
cluster a set of assertion objects and provides a clustering structure where each 
class is represented by an assertion object, generalising all its members. We then 
show how to extend this algorithm so as to take into account probabiUty 
distributions on discrete variables. This extension is made by suitably adapting 
the generalisation step and the generality degree used by the algorithm. 

Key-Words : Symbolic Data Analysis, Hierarchical Clustering, Pyramidal 
Clustering, Conceptual Clustering, Probabilistic Data 



1. The Clustering Models 

The proposed methods use as clustering models the hierarchical and the 
pyramidal structures. Let Q be the (finite) set of objects being clustered. 

A hierarchy on Q is a family H of non-empty subsets of Q, such that : 

Va€Q, {a} €H ; QeH 

V h, h' € H, h n h' = 0 or h c h' or h' e h 

An hierarchy is hence a set of nested partitions of n. 

The pyramidal model (Diday (1984, 1986), Bertrand (1986, Bertrand, Diday 
(1985)) generalises hierarchies by allowing non-disjoint classes at each level. 

A pyramid is defined as a family P of non-empty subsets of fi, such that : 

Va€Q,{a}€P ; QeP 

V p, p' e P, p n p' = 0 or p n p' e P 

there exists an order 0 / each element of P is an interval of 0 



A pyramid is a family of nested overlappings on Q, subjected to the condition of 
a total linear order. Hence, a pyramid provides both a clustering and a seriation 
on the data. 
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Pyramids can be considered as being halfway between hierarchies and lattices : 
the pyramidal model, leading to a system of clusters which is richer than that 
produced by hierarchical clustering, dlows for the identification of concepts that 
the hierarchical model would not identify ; the existence of a compatible order 
on the objects leads, however, to a structure which is much simpler than lattices 
(absence of crossing in the graphical representation) and hence much easier to 
interpret. 



2. Symbolic Hierarchical-Pyramidal Clustering 
Ilie Basic Algorithm 

An algorithm for symbolic hierarchical-pyramidal clustering has been proposed 
(Brito (1991, 1994)). This algorithm allows to cluster a set of symbolic assertion 
objects. The criterion that guides the formation of clusters is the duality 
intention-extension : a cluster is formed if it can be represented by a complete 
assertion object (conjunctive description) whose extension is the cluster itself 
Recall that an assertion is complete if it describes exhaustively its extension, and 
if it is the less general assertion to fulfil this condition. 

The proposed algorithm proceeds in a bottom-up way : 

Starting with the one-object clusters {aj}, i = l,...,n at each step, form a cluster 
p union of p 1 , . . . , pj and represented by s such that 

a) pi,. . ., pj can be merged together : 

- if the structure is a hierarchy : none of them has been aggregated before ; 

- if the structure is a pyramid ; none of them has been aggregated twice, 
and p is an interval of a total order 0 on Q 

b) 5 is complete 

c) Generalisation step : j is more general than si,...,jj => choose s=5j 

d) extQ j = p 
Recall that if 

si= [yi = Vi] A . . . A [yp = Vp] and S2 = [yi = Wj] a. . a [yp = Wp] then 
SI u S2 =[yi =Vi»^Wi ]a ... A[yp = VpLj\Vp]. 

The non-unicity of the clusters meeting these conditions lead to the definition of 
a numerical criterion, which was called "generality degree" and which measures 
the part of the description space covered by an assertion : 

Let a = A [ yj = Vj ], Vj c Oj , and admit that Oj is bounded, 1 < i < p (this 

is not a major restriction, since the set fi to be clustered is always fiinite). The 
generality degree of a is defined as : 
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P c(V) P 

G(a) = n = n G(e.) 

i = l i = l 

if a = A t\ 

where c stands for cardinal if Oj is discrete and interval length if Oj is 

continuous. Notice that G{a u b) is NOT a dissimilarity measure between a and 
b. 

The cluster to be formed should meet conditions a)-d), and minimise G(s). 

An indexed hierarchy is a couple where H is a hierarchy and f a mapping , 
f : H -> R^, such that : 

0 %) = 0 c> card(p) = 1 ; 

ii) V p,q e H, p c q ^ f(p) ^ fl[q) 

As in the case of hierarchies, a height is associated to each cluster of a pyramid, 
defining an indexed pyramid ; 

Let f be a mapping defined on a pyramid P and taking values in R*. (P,f) is said 
to be a weakly indexed pyramid if 

i) fl[p) = 0 o card(p) = 1 ; 

ii) V p,q e P, p c q => fljp) < f(q) 

iii) V p,q G P, p c q and f(p) = :^q) => 

3 pi, P2 G P : PI c p, P2 c p. Pin P2 = p 

We index the hierarchy or pyramid by G(s), that is, if s is the assertion object 
associated to cluster p, than the height of p is G(s). G is strictly increasing, so it 
meets conditions ii) and iii). Condition i) is not met, since G(s) > 0 V s. 

The algorithm comes to an end when either 

- a cluster corresponding to the whole set Q is formed or 

- no cluster meeting conditions a) and d) can be formed ; in this case we obtain 
an incomplete hierarchy / pyramid. 



3. Symbolic Clustering of Probabilistic Data 

The mm is to extend the method so as to take into account probability / 
fi'equency distributions on discrete variables, that is, so that it allows to cluster 
assertions which comprehend probabilistic elementary events of the form : 



[yi= {mj (pi), ... ,mi^(pj^)}] 
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where {ni|, mj^ } is the set of categories of variable yj, 0 < pi < 1 and 
P+...+ P =1. 

This type of symbolic objects arises in describing individuals for which variable 
yj is a random variable. If the corresponding probability distribution is known, 

then pj = Prob(yj = mj ), j=l,...,k. Otherwise, these probabilities are estimated 

by sample frequencies (which implies the repeated observation of variable yj on 

each individual). As an example, we can consider the clustering of entities for 
which repeated measurements on some physical or chemical parameters 
(variables) have been done. 

This extension leads to some adaptation changes in two steps of the algorithm : 

- the generalisation step ; 

- the computation of the Generality Degree. 

Generalisation of two probabilistic elementary events can be done in two ways, 
each with a particular semantics : 

a) Generalisation bv the Supremum : 

In this case, generalisation of two elementary events is to be done by taking, for 
each category mj the supremum of its probabilities : 

[yi={mi(PjX ,m^(pi.)}] u [yj = {m?(p?), , m^(p^) } ] = 

[yi = {mi(PiX , mk(Pk) } ] where Pj = max { pj , p] } , 

Notice that the generalised elementary event is no longer probabilistic, in the 
sense that the obtained distribution is no longer a probability (or frequency) 
distribution : P|+. . . + pj^ > 1 . 

The extension of such an elementary event e , as concerns variable yj, is the set 
of probabilistic assertions a defined by 



exte={a/pj ^ pj , j=l, k } 



and consists of probabilistic assertions such that for each category of variable yj, 
the probability (frequency) of its presence is at most pj . 

The Generality Degree of an elementary event e is then defined as : 



k 
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b) Generalisation bv the Infimum : 

In this case, generalisation of two elementary events is to be done by taking, for 
each category nij the infimum of its probabilities : 

[Yi - {mi(p}),- nik(Pk) } ] [Yi = m|(pb } ] = 

[Yj = , m^(pj } ] where Pj = min {pj, p| } 

The generalised elementary event is no longer probabilistic : in this case, 

Pl+...+ Pk< 1. 

The extension of such an elementary event e , as concerns variable yj, is the set 
of probabilistic assertions a defined by 

exte= {a/ pj < p^J=l, ...,k} 

and consists of probabilistic assertions such that for each category of variable yj, 
the probability (fi-equency) of its presence is at least pj . 

The Generality Degree of e is then defined as : 

0.(e) - 

In each case, ext e □ ext ei and ext e □ ext 62 , that is, e = ei u ea is more 
general than both ei and 62- Also, G(e) > G(ei) and G(e) > 6(02), that is, the 
generality degree is increasing with respect to generalisation. 

Example : 

Let Yi be the variable “colour”, and Oi = {blue, green, violet}. 

Let ei = [ colour = (blue (0.4) , green (0.4), violet (0.2) } ] and 
e2 = [ colour = {blue (0.5) , green (0.5) } ] 

Gl(ei) = Gi(e2) = 1/3 ; G2(e.) = Gj(e2) = 2/3 

Supremum case : 

e = ei u e2 = [colour = (blue (0.5) , green (0.5), violet (0.2) } ] ; 

Pl+ P2+ P3 = 1-2 > 1 ; Gj(e)= 1.2/3 = 0.4 
Infimum case : 

e = ei u e2 = [colour = {blue (0.4) , green (0.4) } ] 

Pi + P2 + P3 = 0.8 < 1 ; G2(e) = 2.2 / 3 
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Conclusion 

In this paper we have proposed a symbolic clustering method for probabilistic 
(frequency) data. The method is an extension of the previously developed 
symbolic clustering method, using the hierarchical or the pyramidal models as 
clustering structure. Two different semantics have been considered in 
generalising probability distributions. The obtained clusters are represented by 
symbolic objects which allow for an interpretation, in the considered semantics. 
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Abstract: In this paper we make a synthesis between the Ichino and Yaguchi 
(1994) and Moore (1991) metrics to obtain a new logical proximity function 
between Boolean symbolic objects. Then, we use histograms defined on these 
objects to obtain a statistical one. For both logical and statistical proximity 
functions, we study its properties and we present examples to illustrate the 
usefulness of our approach. 

Keywords: Symbolic Objects, Proximity Functions, Histograms. 



1. Introduction 

In conventional exploratory data analysis each variable takes a single value. 
Nowadays statistical data bases are able to store more general data spreading 
from single value to interval or set of values. Such data set are called Boolean 
symbolic objects. To extend methods of usual data analysis to these objects is 
the main aim of the Symbolic Data Analysis (Diday (1991)). 

Proximity functions play a major role in exploratory data analysis. For example, 
clustering and ordination may have as input data a matrix of proximities. 
Proximity functions between Boolean symbolic objects can be found in Gowda 
and Diday (1991), Ichino and Yaguchi (1994), De Carvalho (1994) and Gowda 
and Ravi (1995). 

Marcotorchino (1991) makes a distinction between logical and statistical 
proximity functions. Logical proximity functions are based on local informations 
(i.e., informations given only by the descriptions of the pair of objects 
considered) while statistical ones are based also on statistical informations 
concerning the whole set of objects. 

In this paper we propose new logical proximity functions between Boolean 
symbolic objects which are based on symbolic operators (join and conjunction) 
and on a positive measures (description potential) defined on these objects. 
Then, we combine histograms, defined on these objects (De Carvalho (1995)), 
with the logical proximity functions to obtain statistical proximity functions. For 
both logical and statistical proximity functions, we study its properties and we 
present examples to illustrate the usefulness of our approach. 
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2. Symbolic objects 

Let D be the set of individuals and co g D an individual. A variable is a function 
yi: Q -> Oi where Oi is the set of values that yi may take. A variable may take no 
values, a single value or several values (a discrete set or an interval) for a 
symbolic object. Let (yi, . . yp) be a set of p variables. 

A Boolean symbolic object (Diday (1991), denoted as a = [yi g Ai] a ... a [yp 
G Ap], where A e Oi, Vi g (1, ..., p}, expresses the condition, "Variable yi 
takes its values in Ai and . . . and variable yp takes its values in Ap". We define a 
mapping associated to it as a: Q {1,0} such that a(o)) = 1 o yi(o) g A, Vi g 
{ 1 , . . . , p) . The extension of a on D is defined as exto(a) = {(o g Q / a(cD) = 1 } . 
Example. Consider the Boolean symbolic object a - [size g [0, 15]] a [Colour 
G (red, white)]. An individual co is such that a(o) = 1 if, and only if, its size is 
between 0 and 1 5 and its colour is red or white. 

Operations. 

Let a = [yi G Ai] A ... A [yp G \] and A = [yi g Bi] a ... a [yp g Bp] two 
Boolean symbolic objects. 

The following operations (Diday (1991), Ichino and Yaguchi (1994)) may be 
defined: 

a) The join is defined as a 0 Z? = [yi g Ai 0 Bi] a . . . a [yp g A ® where 

• if yi is quantitative or ordinal qualitative, A 0 Bi = [min(AiL, Bil), max(Au, 
Biu)], where (Ail, Bil) and (Aiu, Biu) are the lower bound and the upper 
bound of the intervals A and Bi, respectively; 

• if yi is nominal qualitative, A 0 Bi is the union between A and Bi. 

b) The conjunction is defined as a a Z> = [yi g Ai n Bi] a . . . a [yp g \ d Bp], 
where A o Bi, i g { 1,. . .,p), is the intersection between the sets A and Bi. 
Description potential. 

Let a = A,^j ai = a,^j [yi g A], ai = [yi g A], a Boolean symbolic object. 

The description potential of a (De Carvalho (1994)), denoted 7 i(a), is the volume 
of the Cartesian product Ai x . . ; x A formed by the individual descriptions 
given by a, and it is calculated by the following expression: 

^(a) = riMA.) (1) 

/=1 



where 



AAi) 



Jcardinal(Ai), ify; is qualitative or discrete quantitative 
[range(Aj), ify; is quantitative continuous 



( 2 ) 



Example. Let a = [Colour e (black, white, red}] a [Height 6 [50,150]]. So, 
7t(a) = 3 (150-50) = 300. 
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3. Proximity functions 

Let ai, . . . , a„ a set of w Boolean symbolic objects and let a = a a^ = a [yi e 
AJ, = [yi e AJ Vi g {l,...,p}, and a = Af^^ a^ = Af^Jyi g AJ, a^ = [yi 
G A J Vi G {l,...,p}, where a, a g { ai,..., an), two Boolean symbolic objects. 
The dissimilarity between a pair of Boolean symbolic object will be calculated by 
a comparison between them according to each variable and by the aggregation 
of these comparisons. 

Logical comparison functions 

We define the logical dissimilarity between the pair of Boolean symbolic objects 
(a, a ) for the variable yi as 



(a,, (a, , a^)' + (a, , a, ))" 



(3) 



where k g {1,2,...}, 0 < y < 0.5 and 



^\y (ai,a,) = (l-2;^)//(Aj nAi) + //(Ai nA,) + //(Ai nAi n(Ai ©A^)), 



^ 2 y (ai,a^) = //(Aj n»Ai) + (l-2/)/i(Ai nAj) + /i(Ai n Ai n(A^ © A^)). 
Two normalised versions of equation (3), where k g { 1,2,. . . } and 0<y<0.5, are: 



O 



2k.y (ai ? aj ) \ 



>ir(a.,a.) 






MO,) 






MO|) 



(4) 



‘I’3krM’a,) = 









.MA.0A.)j 






.MA.0A.); 



(5) 



Proposition I. The proximity function e {1,2,3}, is a metric. 

Remark. Taking k = 1, in equations (3) and (4), we get Ichino and Yaguchi 
(1994) metrics. Also, if all the variables take interval as values and if we take 
lim Oiky we get Moore (1979) metric. 

k->oo 

Statistical comparison functions. 

Let a = Af^j a^ = a;^, [yi g AJ, a g { ai,..., an). Let m g Oi a modality of a 
discrete variable yi. The relative weight of m is defined as (De Carvalho (1995)): 
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RW(m) = 



^ ^(m) 

cardinal(Aj)> 




^(m) = 



Jo, if m ^ 

[l, if m G A, 



( 6 ) 



Let Yi a continuous variable et let Ii, . . I„ ni intervals which are a partition of 

Oi. The relative weight of a g {1, ni}, is defined as (De Carvalho 
(1995)): 



RW(IJ- 



'^ Range(I„ nA,)^ 
Range(A.) ^ 




( 7 ) 



Let a, = [yi e A J and let 



H(A,)^ 



77(m) 



, if Yj is discrete. 



Jl, ifRW(m) = 0 
where ;;(m) = < 

[RW(m), ifRW(m)^0 

^ Range(I^) . 

2 ^ ^ , if Yi IS continuous. 



( 8 ) 



7(1 J 



where = 



Jl, ifRW(IJ = 0 
{rW(IJ, ifRW(IJ^0 



The function H defined above gives a weight (which is the inverse of the relative 
weight according to equations (6) and (7)) to each value or interval which is in 
A. Consequent^, a value which is rarety observed (small relative weight) will 
have a big weight. 

An extreme case arrive when a value or an interval has null relative weight. This 
means that this value or interval is not on the description of anY Boolean 
SYmbolic object. In that case, the function H does not furnishes a weight to this 
value or interval. 

We define the statistical dissimilaritY between the pair of Boolean SYmbolic 
objects (a, a ) for the variable Yi as 



(a, , a, ) = U (a, , a, ))" + (a,, a, ))' 



( 9 ) 



where k g {1,2,...}, 0<y <0.5 and 

^3, (a. , a, ) = (1 - 2/)H(A^ n AO + H(A. o A^ ) + H(A. n Ax n (A, 0 A^ )) , 
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<l> 4 ,(a.,a.) = H(A. nAi) + (l-2?')H(A. nA.) + H(Ai nA. r,{A^ ©A,)). 
Two normalised versions of equation (9) are; 



O 



5kr(ai.a.) = j2 



<!>3r(a.>a.y 



H(0.) 



H(0.) J 



( 10 ) 



<^>6kr(a»ai) = 



1 


r ?>3r(a.,a,)' 


+ — ^ 


kT 


2 


Ih(A,©A,)J 


Ih(A,®A.)v 





(H) 



where k g (1,2,...}, 0 < y < 0.5. 

Aggregation functions 

We define the global dissimilarity between the pair of Boolean symbolic objects 
(a, a ) as 



d,(a,a) = - 
P 






( 12 ) 



wherej e (1, 2, 3, 4, 5, 6}, k e (1, 2, ...} and r e (1, 2, ...} 

Remark. This global dissimilarity will be logical or statistical according to the 
comparison function choose in (12). 

Proposition 2. The proximity function dr is a metric if we choose , j e 
{1,2,3}, in (12). 



4. Discussion 

From the functions (j)iy and cj) 2 y, we may see that the logical dissimilarity functions 
defined on equations (3), (4) and (5) are based on a measure of the values which 
expresses the differences between the pair of Boolean symbolic objects for the 

variable yi ( p(A^ n Ai ) and p(Ai n ) ) and on a measure of the values which 
appear when yi is qualitative ordinal or quantitative and when A. and A- are 
disjoint (//(Ai o Ai n (A,©AJ)). 

As the logical dissimilarity functions, the statistical dissimilarity functions defined 
on equations (9), (10) and (1 1) are also based on a measure of the values which 
expresses the differences between the pair of Boolean symbolic objects for the 

variable yi (H(A- oAi) and H(Ai o A-)) and on a measure of the values which 
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appear when yi is qualitative ordinal or quantitative and when A- and A- are 

disjoint (H(Ai o A, n(A^ 0 A^)) ) but now weighted by the inverse of the 
relative weight. 

In the case of quantitative variables, to obtain a set of intervals forming a 
partition of its observation set, we propose to order the set of the lower bound 
and upper bound of the intervals of each symbolic object. Each interval is 
formed by two subsequent bounds. The idea is to get intervals which are strictly 
included on the set of values which is the description of each symbolic object. 

In order to corroborate the efficacy of our approaches, several simulation studies 
with both logical and statistical dissimilarities were made. The effectiveness of 
our proximity functions were illustrated by the results given by a hierarchical 
conceptual algorithm based on simple distance calculation (Ichino and Yaguchi 
(1994)) applied on different data sets (chemical, biological and others). All the 
simulations made seems corroborate the approaches. 

In these approaches, a great logical dissimilarity is get when the number of 
differences and the separation between the objects is important as waited, 
whereas a great statistical dissimilarity is get also when these differences and 
separation are rarely observed on the symbolic objects of the Knowledge base. 



References 

De Carvalho, F. A. T (1994). Proximity coefficients between Boolean symbolic 
objects. In: New Approaches in Classification and Data Analysis, Diday, E. 
et al (eds ), 387 - 394, Springer - Verlag, Berlin - Heidelberg. 

De Carvalho, F. A. T (1995). Histograms in symbolic data analysis. Annals of 
Operations Research, 55, 299-322. 

Diday, E. (1991). Des objets de I'analyse de donnees a ceux de I'analyse de 
connaissances. In: Induction symbolique et numerique a partir de donnees. 
Diday, E. et Kodratoff, Y. (eds.), 9 - 75, Cepadues Editions, Toulouse. 

Marcotorchino, F. (1991). La classification automatique aujourdhui: bref aper 9 u 
historique, applicatif et calculatoire. Publication Scientifiques et Techniques 
d IBM France 2, 35-94. 

Gowda, K. C. and Diday, E. (1991). Symbolic clustering using a new 
dissimilarity measure. Pattern Recognition 24 (6), 567 - 578. 

Gowda, K. C. and Ravi, T.V. (1995). Agglomerative clustering of symbolic 
object using the concepts of both similarity and dissimilarity. Pattern 
Recognition Letters 16 , 647 - 652. 

Ichino, M. and Yaguchi, H. (1994). General Minkowsky metrics for mixed 
features type data analysis. IEEE Transaction on System, Man and 
Cybernetics 24, 698 - 708. 

Moore, R. E. (1979). Interval Analysis. Prentice Hall Inc., Englewood Cliffs, 
N. J. 




Vertices Principal Components Analysis 
With an Improved Factorial Representation 



A. Chouakria , E. Diday and P. Cazes 
Universite Paris IX Dauphine, LISE-CEREMADE 
Place du Marechal de Lattre de Tassigny F-75016, Paris France. 
Ahlame.Chouakria@inria.fr, Chouakri@ceremade.dauphine.fr 



Abstract: We propose, in the framework of the Symbolic Data Analysis 
(Diday 89), a new method which takes as input an array of objects described by 
interval data (Cazes & al. 97). In the factorial space, objects are described by an 
interval type principal components. It is therefore, represented in the factorial 
plane by a rectangle, a segment or a point . The interpretation parameters are 
generalised accordingly. An additional iterative subroutine is proposed to give 
a rectangular representation which takes into account the relative contributions 
of the objects. An application on "face's recognition" is provided to illustrate 
the effectiveness of the proposed whole method. 

Keywords: Principal component analysis, interval data. Symbolic Data 
Analysis. 



1. Introduction 

Where do the interval data come from ? An interval can be the value taken by 
an attribute describing a database's object, or the value taken by a slide 
describing an expert system's frame, or merely the value taken by a feature 
describing a class obtained by an automatic classification algorithm. An interval 
can also be the result of an imprecise measure, a confidence interval, or the 
summary of a set of observations, etc. Semantically, an interval expresses 
imprecision or variability knowledge. 

Intervals expressing an imprecise knowledge, are often of little span and 
denoted [x - S, x + where, x is the observed value and S is the 
imprecision of the measuring instrument. Intervals expressing variability 
knowledge, generally summarise a set of values. For instance, the description 
of the number of petals of the object plant, which has many flowers, is the 
interval which encloses the numbers of petals of all the flowers of the plant. 

We wish to extend principal component analysis to interval data. We suppose 
that all the information at our disposal is the bounds of the intervals with no 
other consideration towards supposed probability distributions. If so many data 
are available, other data analysis methods would be more appropriate. 
However, in any case the synthetic power of interval representation should be 
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worth considering. Our Vertices Method applies the dimensionality reduction of 
Principal Component Analysis while coping with the specific semantics 
expressed by interval data. 



2. Interval Data 

Let //i , . . . , be m objects described by q features of interval type. 







^ \ 1 




\ 
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where, [x^j^Xy] is the interval taken by the feature for the object and Xy , 
Xy are, respectively, the lower and the upper bounds of feature Xj for object 
Hi . An interval [xy,^] is called trivial if it is reduced to a point : Xy = Xy . 

Let be the number of no trivial intervals describing the object h^. Such 
object can be represented in the descriptive space, defined by the q features 
Xj, by an hyper-parallelogram-rectangle (called simply hyper-cube in the 

sequel) of =2^' vertices. The lengths of the sides of the hyper-cube are given 
by the span of the intervals of the corresponding descriptive features. We use in 
the following, crudely the term hyper-cube for referring a point = 0 ), a 
segment {q = 1 ), a rectangle {q=2 ), etc. We propose to describe each object 
defined in (1) by a matrix ( Vy ^ ^ columns, where 

is the value taken by the feature Xj for the k-th vertex of the hyper-cube . 

The matrix Xff, gives the description of { S { ,..., 5^ } the set of the vertices of the 
hyper-cube //, by the q features Xj . 

S.Objects and Vertices weighting 

Let Pi be the weight of the object Hi . The main idea of the Vertices Method is 
to distribute all the mass of the object over it's vertices. Let Xij be the reference 
value that best represents the interval If the distribution inside the 

interval is known is the mean, the mode... of the distribution. If no 
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information is available, x^j is taken as the centre of the interval. We 
determine the probabilistic coefficients py , py of , respectively, the bounds 

Xy and Xy so that Xy becomes the gravity centre of the bounds Xy and Xy : 

Pij Pij Pij ^ij + Pij ^ij ^ . We define the weight of the 

vertex 5*^ as the product of the probabilistic coefficients of the associated 

bounds : ^ with with, p{x^t^) = py if =Xy and 

p{ x^i ) = py if x^i . = Xy . We can consider other operators than the 
product. 

4. Principal components scores 

m 

Let X {x i . ) be the data matrix of ( n= ) rows and q columns, to be 

/=i 

m 

diagonalised. We note ^{pc^i .) the matrix of ( n= ) rows and p columns 

giving the coordinates of the vertices of the hyper-cubes, on the p first principal 
components. The coordinate of the hyper-cube on the j-th principal 

component of interval type is : pcy = min^^i s' j P^ s' j * 

m ni - 

Let Vs be the variance-covariance matrix of general term = X Y.Ps‘ ^ ’ 
it can be defined as the sum of two matrices Vc and E. Vc of general term 

m 2 

= is the variance-covariance matrix of numerical features 

/=i 

obtained by substituting a single central value to each interval. E of general 

m 

term Cjj = Y^pipy py{xy - Xy) is a diagonal matrix which expresses the 
variability due to the interval data : + Cjj 

5 . Generalised interpretation parameters 

The parameters of interpretation are easily extended to the objects Hj . To 
measure the relative contribution of the object Hi . according to the j-th 

^ Pc' P(^l> i 

factorial axis, let us introduce : Cor ( Hi,u, ) = ^ — 

Pi k^\d^{S[,G) 
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which indicates the mean of the square cosines of the angles made by the 
vertices and the j-th factorial axis. We denote the relative contribution of 

rti pel, 

according to the j-th axis Cor{S[,u,)= X — — — — • Similarly, we measure 

^ k=\d^(Sj,,G) 

the contribution of to the inertia of the j-th factorial axes: 

Ctr{Hj,Uj) = and to the total inertia of the n vertices of the 

^j 

m objects : Inr{H^) = ^ 

Ij 

6. More precise factorial representation 

The rectangle {{pCij,pCy]\pcn,pc^i]) representing the object in the 
factorial plan ) , is an over-cover of the hyper-cube projection. It is 

built from all the vertices S\ . without regards to the relative contribution of 
the vertices on the factorial plan » ^c/ ) • propose an iterative subroutine 

which gives for each pair of principal components (P^ and for each 

level a ( 0 < a < 1 ), the representation of the object , by a rectangle 

(point or segment) built from the vertices 5]^ whose relative contributions 

Cor{S[) = Cor{S[,Uj) + Cor{S[,Ui) according to the j-th and the 1-th factorial 

axes, are greater or equal to cr . In the present work the level a is a priori fixed 
by an expert. Our current studies aim to define the level a in terms of the 
quality representation of the objects. The proposed subroutine, allows to 
estimate the accuracy of the overlap, and of the closeness of the objects, in the 

factorial plane. Let [pcy ,pcy ] be the coordinate of the object , on the j-th 

axis of the factorial plane (P^ ,P^^)" of the level a : 

> a| , /7c“ = / Cor(Sj,) > a| 
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7. Face recognition application 

Data description : The face recognition data (Leroy & al. 96) describes 27 
images of 9 human faces : {Fra (o), Hus ( A ), Inc(+), Isa ( x ), Jpl( 0 ), Kha,( V ), 
Lot(0), Phi(*), Rom (<+>) }. Three images per person are taken. The 27 
images are described by 6 features Jfy of interval type. Each feature Xj 

measures a distance (by a number of pixel) between two particular points of 
the face (figure 1). The input data can be summarised in a table of 27 rows and 
6 columns. The interval feature expresses the imprecision measure due to the 
rotation, the deviation, the removal, etc. of the face while taking the image. 




Results : The Vertices Method applied to the previous described data draws off 
three distinct classes of similar human faces : {Hus, Jpl, Phi}, {Fra, Inc) and 
{Kha, Lot, Isa, Rom}. Each image is represented, in the factorial plane, of the 
two first principal components by a rectangle, a segment or a point. The 
representation of the 27 human faces images, in the factorial plane of the two 
first principal components, of the level a = 0.2 is given in the figures 2 (a). A 
more precise representation of the level a = 0.6 is given in the figure 2 (b). We 
can say that objects which are represented by horizontal segments are 
characterised by high variation on the most correlated features with PCI, 
whereas the objects represented by vertical segments are characterised by high 
variation on the most correlated features with PC2. 
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Abstract: Instead of representing data as a point within the description space, 
Symbolic Objects represent them as hyper-rectangles, to take into account some 
variability within the description. They also enable to add some domain knowl- 
edge represented by rules which reduce the description space. Unfortunately, 
this supplementary knowledge induces most of the time a combinatorial grow- 
ing of the possible calculus time. In a preceding paper we presented a method 
leading to a decomposition of symbolic objects into a Normal Symbolic Form 
(NSF) which allows an easier calculation. In this paper we will detail the com- 
plexity of computation, comparing the usual method and the one induced by the 
NSF. 

Key words: Symbolic Objects, rules, complexity 

1. Introduction 

Constrained Boolean Symbolic Objects defined by Diday (1991) are better 
adapted than usual objects of data analysis to describe classes of individuals 
such as populations or species, being able to take into account variability. They 
are expressed by a logical conjunction of elements called elementary events, 
and each of this elementary event represents a set of values associated with a 
variable. Each Boolean symbolic object describes a volume which is a subset of 
the Cartesian product of the description variables. 

Some domain knowledge can be added to a set of symbolic objects by different 
kinds of rules which express dependencies between the variables. 

These rules reduce the description space, they interfere greatly on the different 
computation between symbolic objects. 

We shall focus on the computation of a useful measure function: the description 
potential (De Carvalho 1997) of each object. We define the description potential 
as the part of the volume described by a symbolic object which is coherent, i.e. 
where all the values are satisfying all the dependencies rules. 
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Until now the methods used to compute description potential were taking rules 
into account by computation of the incoherent part of each object. This compu- 
tation is combinatorial with the number of rule. 

In a preceding paper (Csemel & De Carvalho 1997) we presented a method 
leading to a decomposition of symbolic objects into a Normal Symbolic Form 
which lead to an easier computation. 

We will see that most of the time we get a polynomial time of computation. 

2. Boolean Symbolic Object (S.O.) 

Let Q be a set of elementary objects generally called "individuals”, described by 
p variables where i e {l,...p}. Let Oj be a set of observations where the vari- 
able y takes its values. The set O = 0,x02...x0p is then called the description 
space. 

An elementary event , denoted by the symbolic expression ei = yi g Vi where 
i G {l,.*p}, Vi c Oi, expresses that "the variable y^ takes its values in Vi". 

A Boolean symbolic object a, is a conjunction of elementary events of the form: 
a = A {ei = [yi eViJ } where the different Vi c Oi represent the intention of a 
set of individuals C c Q. It can be interpreted as follows: the values of y for 
any individuals c g C are in Vi. 

Example : a = [colour g {blue, red}] a [size g (small, medium}] 
means that the colour of a is red or blue and the size small or medium. 

SO can be associated with a domain knowledge which constraints the descrip- 
tion given by the objects. Such constraints are expressed as dependencies 
between the variables. We take into account two kinds of dependencies ex- 
pressed by rules. 

- The first one, called hierarchical dependencies is of the form: 

if A e(a aj,..a^ then B has No Sense, as in : 

if wings G {Absent} then wingsjoolor = No_Sense (rl) 

- The other, called logical dependencies, is of the form: 

if A e {a j,.. then B e{b ],.. bk..,bm} as in : 

if wings joolor € {red} then Thorax jcolor e {blue} (r2) 

The term No_Sense means that the variable does not exist, hence it's value is 
not applicable. Both of these constraint reduce the description space, but the 
first one reduces the number of dimension, the second not. We will call premise 
variable a variable which is present within a premise. 



3. The Description Potential (D.P.) 

The description potential is the measure of the coherent part of the hypervolume 
described by a symbolic object. By coherent we means not in contradiction with 
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the rules. The D.P. is a good measure to compare symbolic object, mostly used 
(but not only) in distance computation. If the computation of the D.P. is straight 
when no rules occurs, it become combinatorial when rules are present, because 
one have to drop out of the D.P. the volume represented by each rule, but to put 
back the volume represented of the intersection of the rules two by two, the to 
drop out the volume of the intersection of the rules three by three and so on. 

So the computation is represented by the following formulae due to De 
Carvalho, and related to the Poincarre Theorem, where 7i(a) represent the D.P. 
of an object, V(a) the hypervolume described by the object a and r^ the different 
rules which constrained the objects. 

7i{a) = V(a)- 7t{a a (-i(r, a. . . .at, ))) and 

7c(a A (-.(r, A. . . .AT, ))) = 7i{a~r^ ) - rc{{a-r . ) a r* )+. . . . 

+ (-!)'■■' A/-j)...) A r,) 

The complexity of the calculation of the description potential of a constrained 
Boolean symbolic object is exponential on the number of rules and linear on the 
number of variables of each connected graph of dependencies. 

4. The Normal Symbolic Form (NSF) 

To avoid this huge computation we were induced (Csemel & de Carvalho 1997) 
to define a normal symbolic form, which consist roughly in a cutting of the 
description space in different subspaces, each of them containing a premise 
variable and all the conclusion variables linked with by rules. 

Then, each of the different subspaces will be cut into different slices where all 
values of the premise variable will lead to the same conclusion. Only the coher- 
ent part of each object will be represented. We say that a knowledge base is 
under Normal Symbolic Form (NSF) if the following conditions are satisfied. 

- First NSF condition: If no dependency occurs between the variables belong- 
ing to the same array, or if a dependency occurs between the first variable VI 
and all the others. Then VI must be a premise variable, and all the other 
variable some conclusion variables linked with VI by a rule. 

- Second NSF condition: If all the values expressed by the premise variable of 
one object lead to the same conclusion (or absence of conclusion) 

The NSF is applicable only if the dependencies form a tree or a set of trees. 
Most of the time a symbolic object has to be decomposed to follow the NSF, 
as we can see in the following example. 



Table 1 original table 





wings 


wingscolor 




UBB 


K1 


{absent,present} 


{red, blue} 


(blue, yellow} 


(big, small} 




{absent,present} 


(red, green} 


(blue, red} 


{small} 
















406 



The previous array represents two Boolean symbolic objects called al and a2, 
the dependencies rules rl,r2 are associated with the definition. 

if wings = Absent then wings color = No_Sense. (rl) 
it if wings jcolor = red then Thoraxjcolor = blue. (r2) 

This rules induces the following dependency tree. 




The description of the objects ayand a2 representing two different (imaginary) 
insect species is not NSF. 

The description has to be transformed in the sequence of the three following 
tables to be NSF. The transformation process follows the dependency tree in- 
duced by the rules. Each premise variable induces a new secondary table where 
the premise variable is the first variable and the other variables are the different 
conclusion variables linked with. 

In these tables the upper left comer contains the table name, a new kind of col- 
umn appears where the values are integers referring to a line in another table 
with the same name as the column. The column D.P. will be used in a 
following paragraph. 

The main table refers to the initial table. Table ! and 2 refers to secondary 
tables, In each secondary table, a double line separates the lines where the first 
variable verifies the premise, from the lines where the first variable does not 
verify it. 



main table table 1 



wings ... 


wings 


colour 


D.P. 


1 


absent 


{4} 


2 


2 


absent 


{5} 


2 


3 


present 


{1,2} 


3 


4 


present 


{1,3} 


3 





wings... 


Thoraxsize 


D.P. 


al 


{ 1,3} 


{big, small} 


10 


a2 


{2,4} 


{small} 


5 



table 2 



colour 


wings_color 


Thoraxcolor 


D.P. 


1 


{ red } 


{blue } 


1 


2 


{ blue } 


{ blue, yellow } 


2 


3 


{ green } 


{blue, red } 


2 


4 


NS 


{blue, yellow } 


2 


5 


NS 


{ blue, red } 


2 



We have now three tables instead of a single one, but only the valid parts of 
the objects are represented: now, the tables include the rules. 

So under NSF the D.P. computation is nearly immediate. Its a recursive process. 
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one has to compute first the D.P of the object included in the table 
corresponding to the leafs of the dependence tree, then compute the D.P. of the 
tables corresponding to an upper level in the tree using the results previously 
computed. 

So we first compute the D.P. of the elements of the table2, then we compute the 
D.P. table! by summing up (for the column color) the D.P. computed in the 
lines of the table! and so on. For example : Line 1 of table one refers to line 1 
and 2 of table 2. The D.P. is the sum of the potential described by these two 
lines : 1 + 2 = 3. In the same way the potential of a I is obtained by multiplying 
the sum of D.P due to the line 1 and 3 of table 1 by the D.P due to the variable 
Thorax -size (2) giving 10. 

5. Complexity using NSF 

When one is using the NSF, the estimation of the complexity has to be 
divided in two parts: 

- the first one concerns the NSF transformation, and has to be computed 
only once. 

-the second one concerns the computation itself. 

Estimation of the complexity of the NSF transformation is depending on the 

quality of the rules provided, and we can not detail it precisely because of the 

lack of place, but it can be appreciated by the factorisation factor F. 

Average size of secondary table 

F = 

Nb of initial Objects 

The main complexity of the NSF transformation is due to a factorisation process 
which limits the growing of the tables. This process is repeated for each (T) 
secondary table, and its complexity is the same that for a sort, roughly 
speaking. So the complexity of the NSF transformation can be estimated to 
TV(N*F)2 where T is the number of premise variables, // the number of objects, 
V the number of variables and F the Factorisation factor. For each real examples 
we have treated, we always had F < 1 then the complexity is polynomial. 

But it's not always the case, and with special kinds of rules, F can be greater 
than one. In this case we will not estimate F but S the size of the greatest table 
which can occur. In the worst case the size of each secondary table is the double 
of the size of its main table. As the NSF transformation is induced by 
dependencies tree, S will be influenced by the shape of the tree. 

- If the tree is well balanced we will have S = N*2^^S (V = W*r, T being the 
number of premise variables, because log(T) is the average deep of a well 
balanced tree, and the 2 factor is due to the fact that in the worst case, each 
secondary table double the number of lines of its main table. The transformation 
is still polynomial. 
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- If the tree is NOT well balanced then the deep of the tree is T, the size S of the 
greatest table will be 2^, this very worst case can happen only if a conjunction 
of rare circumstances occurs. The complexity is still lower than is the usual case 
because it depends on the number of premise variables T which is smaller than 
the number of rules. 

Considering the data in NSF, then the complexity of the computation itself is 
linear following the number of variables and the number of rules, and polyno- 
mial (N*F)^ following the number of objects: (F*N)^*V*T. 

6. Conclusion 

The decomposition of symbolic objects following the Normal Symbolic Form 
induce an easy way to take dependencies rules into account to deal with 
symbolic objects. An our first application trial we obtain a reduction of about 
90% of our computational time, including the NSF transformation process. On 
real data, we never observed a size growing of the data, so we always obtained 
a polynomial computation time. We need to carry our work on simulated data to 
determine more precisely the performances of NFS and to see more precisely 
the influence of the rule form on these performances. 
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Abstract: Knowledge extraction from large data bases is called "Data Mining". In 
Data Bases the descriptions of the units are more complex than the standard ones 
due to the fact that they can contain internal variation and be structured. 
Moreover, symbolic data happen from many sources in order to summarise huge 
sets of data. They need more complex data tables called "symbolic data tables" 
because a cell of such data table does not necessarily contain as usual, a single 
quantitative or categorical values. For instance, a cell can contain several values 
linked by a taxonomy . The need to extend standard data analysis methods 
(exploratory, clustering, factorial analysis, discrimination,...) to symbolic data 
table is increasing in order to get more accurate information and summarise 
extensive data sets contained in Data Bases. We define "Symbolic Data Analysis" 
(SDA) the extension of standard Data Analysis to such tables'. "Symbolic objects" 
are defined, in order to describe in an explanatory way classes of such units. They 
constitute an explanatory output of a SDA and they can be used as queries of the 
Data Base. A symbolic object is "complete" if its "extent" covers exactly the class 
that it describes. The set of complete symbolic objects constitutes a Galois lattice. 
The SDA tools developed in the European Community project "SODAS" are 
finally mentioned. 

Key words: Symbolic Data Analysis, Conceptual Analysis, Complex data. Data 
Mining, Partitioning, Hierarchical and Pyramidal clustering. Lattices. 



1. Introduction 

In a census of a country, each individual of each region is described by a set of 
numerical or categorical variables given in several relations of a Data Base. In 
order to study the regions, we can describe each of them summarising the values 
taken by the individuals which leave in, by means, intervals, subsets of categorical 
values, histograms, probability distributions, depending on the concerned variable. 
In such a way we obtain a "symbolic data table" where each row defines the 
"description" of a region and each column is associated to an original variable. An 
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extension of standard Data Analysis to such data table is one of the aim of what 
we have called "Symbolic Data Analysis". 

Another important aim is to obtain (or "mine") explanatory results by extracted, 
which we have called "symbolic objects". The description of a region is called 
"intent", the set of individuals which satisfy this intent is called "extent". A 
"symbolic object" associated to a region is defined by its intent and by a way of 
finding its extent. Its syntax has an explanatory power. For instance, the symbolic 
object defined by the following expression: 

a(w) = [age(w) g[30, 35] ] A[Number of children(w) < 2]. It gives at the same 
time the intent of a class of individuals by the description d = ([30, 35], 2} and 
allows to calculate its extent. It means that an individual "w" satisfies this intent if 
his age is between 30 and 35 years old and he has less than 2 children. If we have 
a set of such classes of individuals, their associated descriptions define a higher 
level symbolic data table. Using this new data table we can obtain classes (of 
classes) which can be associated to higher level symbolic objects. For instance, the 
following symbolic object b(w) = [age(w) c [25, 40] ] a [Number of children(w) 
< 3] contains in its extent the preceding class of individuals considered as a 
higher level unit. In this paper, we show how Data Bases constitutes a natural 
source of symbolic data tables. Thus symbolic objects are then introduced in order 
to describe classes of units described by rows of such data tables. We give a 
formal definition of symbolic objects and some properties of them. Finally, some 
tools developed in the "Sodas" European Project are presented. 



2.The input of a symbolic data analysis 

"Symbolic data tables" constitute the main input of a Symbolic Data Ajialysis . 
They are defined in the following way. Columns of the input data table are 
« variables » which are used in order to describe a set of units called "individuals". 
Rows are called « symbolic descriptions » of these individuals because they are 
not as usual, only vectors of single quantitative or categorical values. Each cell of 
this « symbolic data table » contain data of different types: 

(a) Single quantitative value : for instance, if « height » is a variable and w is an 
individual : height(w) = 3.5. (b) Single categorical value: for instance, Town(w) = 
London. 

(c) Multivalued: for instance, in the quantitative case height(w) = {3.5, 2.1, 5} 
means that the height of w can be either 3.5 or 2.1 or 5. Notice that (a) and (b) are 
special cases of (c). 

(d) Interval: for instance height(w) = [3, 5], which means that the height of w 
varies in the interval [3,5]. 

(e) Multivalued with weights: for instance a histogram or a membership function 
(notice that (a) and (b) and (c) are special cases of (e) when the weights are equal 
to 1). 

Variables can be: 
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(g) Taxonomic: for instance, « the colour is considered to be "light" if it is 
"yellow", "white" or "pink" . 

(h) Hierarchically dependent : for instance, we can describe the kind of computer 
of a company only if it has a computer, hence the variable “does the company has 
computers? “ and the variable “ kind of computer” are hierarchically linked. 

(i) With logical dependencies, for instance: « if age(w) is less than 2 months 
then height(w) is less than 10 ». 



3. Source of Symbolic Data: 

Symbolic data happen from many sources in order to summarise huge sets of data. 
They result from expert knowledge (scenario of traffic accidents, type of 
unemployment, species of insects ...), from the probability distribution , the 
percentiles or the range of any random variable associated to each cell of a 
stochastic data table, from Data Analysis (factorial analysis, clustering, neural 
networks, ...) of standard data table in order to summarise and explain the results, 
from time series (in describing intervals of time), from confidential data (in order 
to hide the initial data by less accuracy), etc. They result also, from Relational 
Data Bases in summarising the answer to a query, or in order to study a set of 
units which description need the merging of several relations as it is shown in the 
following example. 

Example: We have two relations of a Relational Data Base defined as follows. 
The first one called "delivery" is given in table 1 . It describes five supplying by the 
name of the supplier, its 

company and the tovm from where the supplying is coming. 



Delivery 


Supplier 


Company 


Town 


Livl 


FI 


CNET 


Paris 


Liv2 


F2 


MATRA 


Toulouse 


Liv3 


F3 


EDF 


Clamart 


Liv4 


"fi 


CNET 


Lannion 


Liv5 


F3 


EDF 


Clamart 



Table 1: Relation "Delivery" 



The supplying are described by the relation "Supplying" defined in the following 
table 2. 
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Supplying 


Supplier 


Town 


FTl 


FI 


Paris 


FT2 


F2 


Toulouse 


FT3 


FI 


Lannion 


FT4 


F3 


Clamart 


FT5 


F3 


Clamart 



Table 2: Relation "Supplying” 



From these two relations we can deduce the following data table 3, which 
describes each supplier by his company, his supplying and his towns: 







ISBBBfHBfiBiWi 


Town 


FI 


CNET 




Vi Paris, V 2 Lannion 


F2 




FT2 


Toulouse 


F3 


EOF 


FT4, FT5 


Clamart 



Table 3: Relation "Supplier" obtained by merging the relations "Delivery" and 
"Supplying". 

Hence, we can see that in order to study a set of suppliers described by the 
variables associated to the two relations we are naturally conducted to take 
account of the four following conditions which characterise symbolic data: 

i) Multivalued: this happen with the variables "Supplying" and "Town" which can 
have several values in the table 3. 

ii) Multivalued with weights: this is the case for the towns of the supplier FI. The 
weights V 2 means that the town of the supplier FI is Paris or Lannion with a 
frequency equal to Vi. 

iii) Rules: some rules have to be given as input in addition to the data table 3. For 
instance, "if the town is Paris and the supplier is CNET, then the supplying is FTl. 

iv) Taxonomy: by using regions we can replace for instance {Paris, Clamart} by " 
Parisian Region ". 



4. Main output of Symbolic Data Analysis algorithms: 

Let Q be a set of individuals, D a set of descriptions, « y » a mapping defined from 
Q into D which associates to each w € Q a description d g D from a given symbolic 
data table. We denote by R, a relation defined on D. It is defined by a subset of DxD. 
If (x,y)eDxG we say that x and y are connected by R and this is denoted by xRy . 
The characteristic mapping of R is hR: DxD {0,1} such that hR(x,y) = 1 iff 
(x,y)eDxD. We generalise the mapping hR by the mapping Hr: DxD -> L and we 
denote [d' R d]= HR(d,d') the result of the "comparison" of d and d' by Hr. We can 
have L ={true, false}, in this case [d' R d] = true means that there is a connection 
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between d and d'. We can also have L= [0, 1] if d is more or less connected to d'. In 
this case, [d' R d] can be interpreted as the "true value" of xRy or " the degree to 
which d is in relation R with d' (see in Bandemer and Nather (1992), the section 5.2 
on fuzzy relations. ). 

For instance, R e {=, =, <, e } or is an implication, a kind of matching, etc. R 
can also use a set of such operators. 

We say that a description de D is "coherent" if it is compatible with the 
conditions (h) and (i) (defined in section 2). For instance, a description which 
expresses the fact that "sex” is "male" and "number of deliveries" is 1, is not 
coherent. These descriptions constitute a set of objects to which any symbolic data 
analysis algorithm can be applied. That is why, we use the word « object » to 
denote any coherent description . A coherent description of an individual, is called 
« individual object ». It can be also the description of a class of individuals, in this 
case it is an "intensional object". For instance, the coherent description of a 
scenario of accidents, of a class of failures, etc. is an intensional object. A 
« symbolic object » is defined both by an object and a way of comparing it to 
individual objects. More formally, its definition is: 

Definition of a symbolic object 

A symbolic object is a triple s = (a, R, d) where R is a relation between 
descriptions, d is a description and «a» is a mapping defined from Q in L 
depending on R and d. 

Symbolic Data Analysis concerns usually classes of symbolic objects where R is 
fixed, "d" varies among a finite set of coherent descriptions and a(w) = H(y(w),d) 
where H is a mapping DxD L. For instance, a(w) = [ h(y(w)) R d] where h can 
be for instance, a filter (see an example thereunder). There are two kinds of 
symbolic objects: 

- « Boolean symbolic objects » if [y(w) R d] e L = {true, false}. In this case, if 
y(w) = (yi,...,yp), the y\ are of type (a) to (d), defined in section 1. 

Example: 

y = (colour, height), d = ({red, blue, yellow}, [10,15] ), colour(u) = {red, yellow}, 
height(u) = {21 }, R = (c , c), a(u) = [colour(u) c {red, blue, yellow}]v [height(u) 
c [10,15] ]= true v false = true. If h(y(w)) = (colour(w), 0), h is a filter as a(w) = 
[h(y(w)R d] = [colour(w) c {red, blue, yellow} ]v [0 c [10,15]]= [colour(w) c 
{red, blue, yellow}] and we get also a(u) = true. 

- « modal symbolic objects » if [ y(w) R d] e L = [0,1]. In this case, the y(w) are 
of type (e). Thereunder, we give an example of such symbolic object. 

Syntax of symbolic objects in the case of ^assertions”: 

If the initial data table contains p variables we denote y(w) = (yi(w),..., yp (w)), D 
= (Di,...,Dp), d G D: d = (di,..., dp) and R' = (Ri,...,Rp) where Rj is a relation 
defined on Di. We call « assertion » a special case of a symbolic object defined by 
s = (a,R,d) where R is defined by [ d’ R d ] = a j =i, p [y\ (w) R i d i ] and "a" by: 
a(w) = [ y(w) R d]. Notice also that we have s considering the expression a(w) = 
i =1, p [ Yi (w) R i d i ]. Hence, we can say that this explanatory expression defines 
a symbolic object. 
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For example, a Boolean assertion where SPC means « socio-professional-category » 
is: 

a(w) = [age(w) c {12, 20 ,28}] a [SPC(w) c (employee, worker}]. If the individual 
u is described in the original symbolic data table by age(u)=(12, 20} and SPC (u) = 
(employee } then: a(u) = [{12, 20 }c (12, 20 ,28}] a [{ employee }c (employee, 
worker} ]= true 

If the variables are multivalued and weighted, an example is given by: 
a(w) = [age(w) Ri {(0.2)12, (0.8) [20 ,28]}] a [SPC(w) R 2 ((0.4)employee, 
(0.6)worker}] where for instance the "matching" of two probability distributions is 
defined for two discrete probability distributions r and q of k values by: r Rj q = 

Aj qj e j j j ' 

Extent of a symbolic object s: in the Boolean case, the extent of a symbolic object is 
denoted Ext(s) and defined by the extent of a, which is: Extent(a) = (w e Q / a(w) = 
true}. . In the modal case, given a threshold a, it is defined by Ext« (s)= Extenta (a)= 
(w G Q / a(w) > a}. 

Order between symbolic objects: if r is a given order on D, then the induced order 
on the set of symbolic objects denoted by rs is defined by : s\ is si iff di r d 2 . 

If R is such that [d R d']= true implies d r d', then Ext(sl) c Ext(s2) if si rs s2 . If 
R is such that [d R d']= true implies d' r d then Ext(s2) c Ext(sl) if si rs s2. 

Tools for symbolic objects: tools between symbolic objects (Diday (1995)) can 
be needed such as similarities (F. de Carvalho (1998)), matching, merging by 
generalisation where a t-norm or a t-conorm (Schweizer, Sklar (1983) and Diday, 
Emilion (1995)) denoted T can be used, splitting by specialisation (Ciampi et al. 
(1995)). Under some assumption on the choice of R and T it can be shown that the 
underlying structure of a set of symbolic objects is a Galois lattice (Brito(1994), 
Polaillon, Diday (1997)), where the vertices are closed sets defined by « complete 
symbolic objects ». More precisely, the associated Galois correspondence is 
defined by two mappings F and G: 

-F: from P(Q) (the power set of Q) into S (the set of symbolic objects) such that 
F(C) = s where s = (a, R, d) is defined by d = Tcec y(c) and so a(w) = [y(w) R 
Tcec y(c)], for a given R. For example, if Tcec y(c) = Ucec y(c) , R s « c », y(u) = 
(pink, blue}, C = (c, c’}, y(c) = (pink, red}, y(c’) = (blue, red}, then 
a(u) =[y(w) R TceC y(c)] = [{pink, blue}c (pink, red}u(blue, red}})=(pink, red, 
blue}] = true and ue Ext (s). 

-G: from S in ?(Q) such that: G(s) = Ext (s). 

A « complete symbolic object » s is such that F(G(s)) = s. Such objects can be 
selected from the Galois lattice but also, from a partitioning, a hierarchical or a 
pyramidal clustering, from the most influential individuals to a factorial axis, from 
a decision tree, etc. 

We can consider four kinds of advantages given by symbolic objects. First, they 
give a summary of the original symbolic data table in an explanatory way, (i.e. 
close to the initial language of the user) by expressing descriptions based on 
properties concerning the initial variables. Second, they can be easily transformed 
in term of query of a Data base. Third, by being independent of the initial data 
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table they are able to identify any matching individual described in any data table. 
Fourth, they are able to give a new symbolic data table of higher level on which a 
symbolic data analysis of second level can be applied. 



5. What standard statistical methods are not able to do that 
symbolic data analysis can do? 

-Starting as input with a symbolic data table. 

-Using generalisation processes during the algorithms-Giving explanation of the 
results in a language close from the one of the user by symbolic objects as output. 

- Giving graphical description taking account on the internal variation of the 
symbolic objects. 

In the European Community Project SODAS (« Symbolic Official Data 
Analysis System») the following methods are developed by its members: 

- Principal Component, Correspondence Analysis and Discriminate Factorial 
Analysis of a symbolic data table, 

- extension of elementary descriptive statistic (central object, histograms, 
dispersion, co-dispersion, etc. from a symbolic data table) to symbolic data. 

- mining symbolic objects from the answers to queries of a relational data base , 

- partitioning, hierarchical or pyramidal clustering of a set of individuals described 
by a symbolic data table such that each class be associated to a complete symbolic 
object. 

- dissimilarities between boolean or probabilistic symbolic objects, 

- extension of decision trees on probabilistic symbolic objects, extension of a 
Parzen discrimination method to classes of symbolic objects, 

- generalisation by a disjunction of symbolic objects of a class of individuals 
described in a standard way. 

- inter-active and ergonomic graphical representation of symbolic objects. 



Conclusion 

Our general aim can be stated in the following way: mining symbolic objects in 
order to summarise huge data sets and then, analyse them by Symbolic Data 
Analysis. As the underlying lattice of symbolic objects often becomes too large, 
other methods which provide also symbolic objects have to be used. The need to 
extend standard data analysis methods (exploratory, clustering, factorial analysis, 
discrimination,...) to symbolic data tables is increasing due to the expansion of 
information technology. This need, has led to an European Community project 
called SODAS (Hebrail (1996)) for a « Symbolic Official Data Analysis System» 
in which 15 institutions of 9 European countries are concerned. Three Official 
Statistical Institutions are involved in this project: EUSTAT (Spane), INE 
(Portugal) and ONS (England). An example of application proposed on their 
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Census data consists in finding clusters of unemployment people and their 
associated mined symbolic objects in a country, calculating its extent in the census 
of another country and describing this extent by new symbolic objects in order to 
compare the behaviour of the two countries. 
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Abstract ; Current technological progress in Hardware, Data Bases and Object 
Oriented languages implies the manipulation, stock and representation of 
objects with more and more complex data. The notion of Symbohc Objects is 
introduced on the base of Diday's work and the necessity to be adapted to this 
notion appears for most recent classification methods. The aim of this paper is 
the adaptation of the classical Bayesian discrimination rule to the Symbolic 
Objects problematic. This will be performed by the a priori probabilities' 
estimation and by a kernel density estimation. 

Keywords : Symbohc objects, training sets, a priori and a posteriori 
probabihties, window's bandwidth, kernel density estimation, bayesian 
discrimination rule, EM-like algorithm. 



1. Introduction 

Kernel density estimation is a tool which allows the statistician to construct a 
density on any sample of data without any a priori probabilistic hypothesis. 
Recent references are numerous (e.g. books by Hand [6], Silverman [9], 
Devroye [2], ...) These methods compute a weighted sum of kernels centered 
on each data point. Examples of kernels density estimations are to be found 
essentially for quantitative (discrete or continuous) and quahtative data. For 
mfaced data of both previous types, the method of "product kernels" is 
suggested (Hand [6]). 

The kernel discrimination analysis tries to give a solution to the following 
problem : "given g samples belonging to g classes, and using them as training 
sets for these classes, how can we affect one new data point (or several) to one 
of these classes?" The classical Bayesian discrimination rule which is "affect the 
new data x to that component pifk of the decomposed global density 

g 

f(x)=^pifk(x) for which is maximum", needs as most of classification 

tel 

methods to be adapted to the Symbolic Objects problematic (Diday [3] [4] [5]), 
in which we can have information of different types: quantitative, qualitative, 
probabilistic, of interval type, ... within a conjunction of theses. 
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2. The decision rule 

Let V be the observation space (the symbolic objects space). The identification 
of the group from which an observation x e V is issued implies the definition 
of an application d, called decision rule: d: V ^{1, g} : x ^ d(x). 

This rule depends of the training points xn, xini, Xgi, Xgng and is 
defined by a minimization of the error rate. The rule d determines a partition 
m Pg) ofV:Pk={x \ d(x)=k }, l<k<g. 

The rule can be defined using g functions hk (x), k=I, ..., g. The function hk (x) 
associated with the it-th population gives an idea of the "similarity" between x 
and the individuals Xki, Xknk of the ife-th training sample. We have: 

P,= {x I hk(x)>hi(x) Vi = l,,..gJ, l<k<g. 



3. The risk vector 

During the affectation of observations to the different classes, some errors can 
occur. Let /?*,, l<i<g, be the probability to affect an individual of group 

k to group i. Given that some errors can be less important than others, a cost 
Ckh l^^g, i^^g. can be associated with the affectation to group i of an 
observation from group k. Note that V k, Ckk =0, no cost being attributed to a 
correct affectation. 

For each population k, 1< k< g, we define the risk Rk(d) as the expected value 
of the cost attributed to a wrong affectation of individuals of the group: 

g 

Rk(d)=^ Cki Pki , 2^ k< g. 

i=l 

Note that the choice of the best rule is a critical problem. 



4. The bayesian rule 

Let's suppose that the a priori probabilities pk , I^^g, of belonging to the 
population k, are known. The bayesian risk is: 

R(d)=^p,R^d). 

*5=1 

Assuming that the data from population k are distributed following a density 
function /*(.), l<k<g, we have: 

Rk(d) = X Pki - X J dx , l<k< g. 



and then: 
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R(d) = ^Pk Cki j fk(x)dx = X J (^PkCkifk(x)) dx. 

fc=l 1=1 ri »=1 n 



R(d) is minimal for the partition P = (Ph Pg) defined by: 

g g 

Pi = {x \ Ckifkix) < Ckjfkix) Vj=l, g }. 

k=l k=k\ 

The costs associated to the classification errors being generally supposed equal: 
Cki = c Vk, i = 1, g,k^i and Ckk = 0 Vk=l, g, we have: 

g g 

Y^Pk Ckifk(x) = C '^Pkfkix) - C Pi fi(x) , 

*=1 *=1 



and finally, leaving the constant terms out: 

Pi=fx\ ^Pifi(x)< ^pjf/x) Vj}= {x\ pifi(x)> pjf/x) Vjh 
The discriminant functions defining the Bayesian rule are then hk(x) = pk fk(x). 
The Bayesian rule can be interpreted by the affectation of an individual x to the 

Pkf 

population k maximizing the a posteriori probability qk(x) = — . 






If there are more than one group maximizing pk fk(x) (3 ipdc : pi fi(x) = pk fk(x) 
> pj f/x) V j), some choices are possible. The individual x can be randomly 
affected to one of these groups, can be affected to the group j (j-min(i,k)), can 
be let unclassified... 

Note that the maximum likelihood rule corresponds to the Bayesian one in the 
case of equal a priori probabilities and equal error costs. 



5* The density estimates 

The classification problem is then solved when the densities fk(x), 1< k<g, are 
known. Actually, these densities are unknown but a sample of each population 
is given. These samples will be used to estimate the g densities functions. 

In discriminant analysis of classical data, hypotheses of multivariate normality 
can be made concerning the densities. The sample data can be supposed to be 
issued from g normal multivariate populations of means /jU, and of equal 

covariance matrices (Sk = S V k=l, ..., g). The training sets are used to 
estimate the parameters: 

Pk= ^* = Vk=l,...,g 

1=1 

X ^^^ki-Xk)ix,^-XkY 

O k=l 1=1 

“ i • 

X«t-^ 

it=l 
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These estimators are plugged in the densities expressions and applying the 
maximum likelihood rule, we obtain the Fisher's linear discrimination rule. 

Note that k^i e{l, g}, we have an equivalence between 

fk(x) >fi(x) and {x - ^ (Xi^-x^f S ''( x^ - )} > 0, 

which is the equation of an hyperplane. That is why we talk about a linear 
discrimination rule. 

In the quadratic approach, the covariance matrices are not supposed equal 
and the parameters are estimated by: 



fit= = Ixj, Vk=l, g 



»=1 



Sk = 5 Vk=l. g. 

-1 

After replacement by the parameters in the densities expressions, the 
equations: 

fk(x) > fi(x), V k^i €{1, ..., g) 
lead to quadratic forms. 

If no hypotheses are made about the densities, non parametric methods must be 
used to obtain density estimates. The kernel method is one of them. 



6. The uniform kernel method 



The use of the uniform kernel density estimate produces for ^/-dimensional 
quantitative data : 



fk M = 



1 









\ K J 



where . - are the training set points of the kth population 



. is the window bandwidth 



. K(.) is the uniform kernel : K(y) = 



Jl if lb 

1 0 else 



< 1 



Roughly speaking, k\ 



x-x,, 
V K 



counts the number of points of the 



training set of population k, which are at a distance less than \ from y 
(ie in a hypercube of size 2 ). 
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The choice of the h^is crucial. According to Rasson and Granville [8], that 
choice has to be made by minimizing the estimated error rate (e.g. by leaving- 
one-out estimation of it). 

As stated in the literature, the choice of the kernel does not seem to be critical. 
For example, other kernels are proposed by Silverman [9] to solve problems 
implying mixed classical data including binary and continuous components: 

^:(y) = ^ d^(x,y) {27t)^ 

where . is the distance between the binary components : 

( d^[x,y) = (x-yf(x-y)) 

. ^2 15 the euclidian distance between the continuous components 

. is the normal density fonction 
. A and h are smoothing parameters 

. ki and are the numbers of components of binary and continuous 
types respectively. 

Among all possible kernels, the uniform one has been chosen for its 
computational and interpretation easiness. 



?• Using kernel density measures for symbolic data 

The building of such a uniform kernel can be generalized to symbolic objects. 
For this purpose, the important thing is to have a dissimilarity measure between 
symbolic objects. If no such dissimilarity is available, it is also possible to work 
with N dissimilarity measures (one measure for each variable), or at least with 
K dissimilarity measures (one for each kind of data involved in the symbolic 
objects). 

As an example, consider that the data are symbolic objects including 3 
qualitative multi-valuated variables, 2 quantitative interval variables and 2 
modal variables. The density estimation can be generalized either using 1 
dissimilarity measure between objects, either using 7 dissimilarity measures 
(one for each variable), or using 3 dissimilarity measures (one for the 
qualitative multi-valuated type, one for the quantitative interval type and one 
for the modal type). The dissimilarity choices are numerous (Ichino[7], De 
Carvalho[l]). 

In the first case, the density estimation is simply performed by counting the 
points of the training set of each population being in the hypercube. In the 
second case, a kernel is built for each variable using the appropriate 
dissimilarity measure. Then the global kernel for the symbolic object is the 
product of the N kernels. The density estimation is then performed by counting 
the points of each training set being in all "intervals". In the last case, a kernel is 
built for each type. Then the global kernel for the symbolic object is the product 
of the kernels built on each type of data. In that last case, the density estimation 
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is performed by counting the points of each training set being in all 
"hypercubes". 

Note that the density estimators used here for symbolic data do not refer to 
classical probability densities, as no measure is available on the symbolic 
objects space. They are simply indicators of the concentration in a given 
neighbourhood. 



8. A priori estimates 



Estimates of the a priori probabilities pk are also needed. We choose 
p = , . . . , ) maximizing the likelihood function : 



Up ) = n sa/*0') 

y *=4 



where the product is taken over all points to be classified. 

The EM-like algorithm computes iteratively the optimum. Beginning with the 

uniform proportions p,^ fP) = — , the EM-like algorithm uses the recursion 



Pk 



1 

a+i; = - 

n 



I 



Pkit)hiy) 

tpjiOfjiy) 



;=i 

where . « is the number of points to be classified 
. ^ is taken over all these points. 

Here, the density estimates are kept constant. 



9. Output data 



Using the given training sets, the algorithm computes, for each input symbolic 
object X, the estimates of k= 1, g. Thanks to the a priori 

probabilities estimates, we obtain, for each x, the following values: 

PkfkM, k= 

By the normalization of these values we finally have the set of a posteriori 
probabilities of belonging to each class : 

Pk fk M 



XPifiM 



k= 1 , ..., g. 
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10. Conclusion 

Currently, it is of primary importance, for any field of work, to be easily 
adaptable to the other fields progress. Classical distance measures do not apply 
to data of symbolic type. New distance notions are elaborated, which make 
possible the adaptation of most Data Analysis methods. So with a well chosen 
distance between objects. Symbolic Kernel Discriminant Analysis can be an 
efficient tool in the field of supervised classification of complex data. 
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Summary: The paper presents a general overview of recent methodology con- 
cerned with the problem of growing a decision tree in the presence of uncer- 
tainty or imprecision. One can identify two main approaches in the literature. 
The first aims at “softening” the hard recursive partition process through the 
use of fuzzy splits. The second proposes to handle the uncertainty by using 
more complex representation languages. Both approaches, which can be re- 
lated in a probabilistic framework, produce in a general way more flexible and 
robust classification rules. 

Key words: decision or classification tree, imprecision and uncertainty, soft 
split, imprecise description, symbolic data analysis. 



1. A sensitive methodology 

Binary segmentation (or decision tree, recursive partition, tree growing,...) 
aims at building a tree-structured classification rule for a given partitioned set 
of objects described by predictors. In spite of many successful applications, 
both in the Artificial Intelligence community (Quinlan 93, Kononenko et al. 
84) and the statistical community (Breiman et al 84, Ciampi 92), several 
authors have emphasized the particular sensitivity of these methods in the 
case of an imprecise data set. For example, (Quinlan 86) highlighted the 
general degrading of the performances of a decision tree from a predictive or 
a descriptive point of view caused by introducing various levels of noise in 
the data. One can see a double origin for this problem: on the one hand, 
the “hard” recursive partition process; on the other hand, the difficulty of 
handling any form of imprecision (measurement errors, subjective judgement 
of an expert,...) during the coding phase. These remarks have led to two main 
approaches that are now presented. 



2. Fuzzy or soft splits 

The objective of the soft splits is to introduce more fiexibility during the re- 
cursive partition process. This kind of split is particularly suitable for two 
reasons. First, the assignment of the objects does not change in an important 
way when the observations, because of an imprecise context, are slightly mod- 
ified. Secondly, it permits one to give more importance to the observations 
satisfying in a better way the properties associated with the split. The general 
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principle of the soft split is represented in figure 1. 




Figure 1: Principle of a soft split. 



Here, the function F may be defined for example to be a logistic function: 



F{y) = 



exp{ay + b) 

1 + exp (ay + b) 



( 1 ) 



where parameters a and b (representing respectively the cutting point and 
the softness of the split) are either estimated automatically by the method 
(for example: (Quinlan 90) or (Ciampi et al. 96)) or given by an expert (for 
example (Weber 92)). These approaches have been widely used within fuzzy 
decision tree algorithms (see (Yuan et al. 95) for more details). There, in most 
of cases, fuzzy splits are represented by triangular or trapezoidal membership 
functions because their easy building and interpretation by the user. 



3. Explicit imprecise observations 

In applications, we often face imprecise observations that are, afterwards, re- 
duced to single values in order to avoid having complex codings. In symbolic 
data analysis (Diday et al. 97), it is proposed to extend the standard “deter- 
ministic” description of an object (that is, a vector of values) to a set of random 
variables, each of them being associated with more complex descriptions such 
as disjunction of values or probability functions. More precisely, if the descrip- 
tion space is based on P predictors Yi,F 2 , the studied population will 

be composed of objects i whose description is given by 

{Yn^ fa,Ya^ fip) ( 2 ) 

where fij represents the probability function describing the imprecision of the 
object i for the predictor j. For example, if one does not know precisely the 
value of a descriptor F, it is possible to take the imprecision into account 
by defining F to be a random variable distributed according to a uniform 
probability function / on [a, 6] (figure 2). 
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Figure 2: Assignment of a probabilistically imprecise description. 



In this case (see (Ciampi et al. 96)), the observation described by / would be 
assigned to the region [Y > s] with probability 




In fuzzy logic, explicitly fuzzy initial data have been much less studied than 
classical data (usually continuous) which are “fuzzified” through the use of 
soft splits. Nevertheless, and unlike the probabilistic approach, a non negli- 
gible attention has been paid to the problem of taking into account a fuzzy 
membership of the objects to the classes of the prior partition (a problem 
which concerns many real applications). 



4. A connection between the two approaches 

Both previous approaches lead to the same notion of a soft or flexible assign- 
ment: it consists in substituting the usual boolean degree of membership to 
the left or right node by a number varying in the interval [0, 1]. 

Moreover, in spite of their apparently distinct motivations and principles, these 
can be clearly related within the following framework. We will here limit 
ourselves to a probabilistic approach. Let us first consider the problem of the 
assignment of N observations yi,...,yN through the binary split [Y < s]/[Y > 
s] (figure 3). Is is representing the usual “hard” split while Fg denotes a soft 
split (with Fs{s) = 0.5). On the basis of yi,...,yN , we build a set of imprecise 
observations , . . . , /y^ centered on yi and with f = F . 

We have then the following result: applying the soft split Fs on the precise 
observations yi, ..., yp is strictly equivalent to use the usual split Ig on the set of 
imprecise observations fy. . This can be formally written through the following 
equalities: 

r+oo 

P[Yi>s] = J^ fyMdy = F,{yi) 

In other words, using for instance a soft split associated to a large “degree 
of softness” is equivalent to consider that we have a classical split applied to 
observations which are all blurred by the same very large imprecision (that 
is, the probability functions fy. associated to large standard errors). As a 
consequence, if a large variability of the slope of the split does not change 
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Figure 3: Connection between a soft split and imprecise observations. 



in an important way the quality of the split, we can deduce that the split is 
robust in the sense that it is not affected by very imprecise observations. 

Of course, if we consider a set of observations which could be affected by differ- 
ent levels of noise or imprecision, it is not possible anymore to draw a parallel 
between these two approaches. 

As shown by several authors, advantages of these two approaches may be 
firstly, to permit the selection of more robust predictors during the growing 
of the tree; secondly, to improve the classification accuracy in preventing too 
many arbitrary assignments, such as can happen when a high level of uncer- 
tainty is present. 

Finally, note that, in a probabilistic context (for both previous approaches), 
the modeling of a decision rule associated with a classification tree is now 
expressed as follows: 

P{c\y) = j2Pt{y)-P{c\t) (4) 

t=l 

where Pt{y)^ which denotes the probability for an observation y to belong to 
the node t. 



5. Semantical aspects 

In the previous section, which was only concerned with a probabilistic con- 
text, we were not interested immediatly in giving a sense to the fuzziness of 
the split but first of all, in statistical considerations: looking for more robust 
splits, which are suitable for noisy contexts, and well-adapted to a given data 
configuration. As remarked by (Yuan et al. 95), decision trees developped 
within the fuzzy logic theory are above all interested in another kind of uncer- 
tainty called cognitive uncertainty. This uncertainty '‘deals with phenomena 
arising from human thinking, reasoning, cognition and perception processes”. 
This kind of uncertainty has been generally employed to model and to give a 
clear interpretation to the fuzzy splits. Such splits are especially employed in 
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order to represent the notion of overlapping between the linguistic terms used 
by an expert. 

In comparison, the assignment process in a probabilistic context (resulting ei- 
ther from a soft split or an imprecise observation) relies on a clearly distinct 
idea: even if it leads also to the calculus of membership degrees of an object 
to the left or right nodes, it is assumed that the object actually belongs to a 
single node; but because of the uncertainty affecting the observation (or the 
split), this membership cannot be computed in a deterministic way. 

About validation. Even if bridges between the different axiomatics of uncer- 
tainty have been proposed (see for instance (Drakopoulos 95)), the question 
of estimating the predictive performances of a decision tree as not been as 
much studied as in the statistical domain (validation is often submitted to the 
expert). Many works interested in that aspect rely on the use of fuzzy prob- 
abilities which permit to employ the classical results of the bayesian decision 
theory. 



5. Discussion: symbolic data analysis and decision trees 

Symbolic data analysis offers a quite suitable framework in order to study the 
problem of recursive partition (or decision tree induction, with respect to the 
terminology used in supervised learning) in the case of more complexe data. 
The main reasons why the formalism of the symbolic objects is particularly 
convenient may be summarized as follows: 

(i) Any decision tree structure may be described by an “organized” list of 
symbolic objects. Indeed, each terminal node built by the recursive par- 
tition may be formally described by a symbolic object, called assertion, 
that is, a knowledge about the initial data set. The sub population asso- 
ciated to this node corresponds to the extension of the symbolic object; 
while its description (the conjunction of successive properties that lead 
to the node) forms its intension] for example, the description of a leaf t 
is usually written as 

where Jt is the subset of the predictors Yj that characterise the leaf t, 
and Vt is the description attached to Yj. 

(ii) its logical representation permits to represent, in a natural way, the dif- 
ferent sub populations obtained through an iterative process (note that 
the various terminal nodes are not necessarily characterised by the same 
predictors); 

(iii) Each assertion st characterising a sub population is a conjunction of prop- 
erties (coming from the binary questions) which are, in general, based 
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on multivaluate descriptions V"/ ; this kind of description is well adapted 
to represent the natural variability inside a population of objects. 



Because the language of symbolic objects permits also to represent the ini- 
tial data set (including for instance imprecise descriptions with the help of 
probabilistic descriptions), the tree growing procedure may be presented in 
the general following way: it consists in searching iteratively for an organized 
set of symbolic objects (of kind (5)) which, at each step, summarizes “as well 
as possible” the first list of symbolic objects (the initial data). To do that, 
one should of course define a suitable measure of affinity between two sets of 
symbolic objects, depending principally on the kind of descriptions that are 
studied. For example, in (Ciampi et al. 96), (Perinel 96), because the aim 
was to summarize a set of probabilistically imprecise descriptions, the affin- 
ity measure was built as an extension of the classical statistical information 
measure. 

The main advantage of presenting the recursive partition problem as above 
(that is, more from a knowledge extraction point of view), is that it permits 
to take into account of more complex descriptions: on one hand, in order 
to describe the sub populations built by the partitioning (here, these can be 
obtained on the basis of probabilistic soft splits); on the other hand, in order 
to take into account complex data (described as in (2)) expressing for instance 
the notions of noise, doubt or imprecision. 
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Abstract: Galois lattices are a kind of conceptual classification. Galois 
lattices theory and construction have been extended from binary tables to 
multivariate, interval and histogram data with missing values via the Sym- 
bolic object formalism. The Galois lattice of all “complete symbolic objects” 
is a possible output, but isn’t often legible. First we propose the reduction 
of the lattice by a pyramid or a hierarchy using numerical and symbolic cri- 
teria. Then we propose to transform the lattice into an identification tree. 
The last proposed solution is the extent of Duquenne’s minimal set of rules. 

Keywords : Galois Lattice, Symbolic Objects, Reduction and Interpreta- 
tion 

This article is organised as follows : first, we introduce Symbolic Data Analy- 
sis; Then, we remind the principle of Galois lattices of complex data; Finally, 
we present some possible symbolic Galois lattice reductions. 

1. Complex data table formalism 

Standard data analysis methods extent to complex data is called Symbolic 
Data Analysis [Diday 91]. Symbolic objects help to represent and analyze 
real data, and allow us to treat directly complex tables such that any case 
could contain a disjunction of values, an interval of values, an histogram of 
frequencies or are empty. 

1.1 Individuals 

The description of an individual u is represented by Ug = 
where each variable ?/i, i=l,...,p, is an application from Q to P(Oj), where 
ft is the set of individuals, Oj the description space for each variable yi, and 
P(Oi) the power set of Oj. The variable yi{uj) may be quantitative (discrete 
or continuous) or qualitative (nominal) and may be represented as a set of 
values, an interval of values, or an histogram of values. 
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1.2 Classes 

Symbolic objects help to formalize classes. 

Let C be the set of all possibles descriptions : C = P{Oi) x ... x P{Op). For 
i=l...p, we define an application on descriptions : 

di : C ^P{Oi) 
c i-> di{c) 

where di{c) may be quantitative (discrete or continuous) or qualitative (nom- 
inal) and represents the description as a set of values, an interval or an 
histogram of values. 

Let R be a relation between three sets : the object set, the description 
set and the variable set. This relation is defined by the operators : R G 
{=,C,D,G,3, The expression yi{w)Rdi{c) is a predicate which con- 
cerns the variable the individual w, the description c, and R is one of the 
previous operator. This predicate is true if the relation is true. 

Definition 1 A symbolic object is a triplet (a,R,d) where R is a relation^ d 
is a description, and a is an application linking the individual, the relation 
and the description. 

Different classes of symbolic objects exist and we remind here the basic 
definition of an assertion object. 

Definition 2 Let R et d be fixed; an assertion object a is a symbolic object 
completely defined by the following application : 

a : Q {true, false} 

u) 1 -^ a{w) = [yi{uj)Rdi] A ... A [yp{u)Rdp] 

such that a{uj) = true if and only if Vz = 1, ...,p, yi{uj)Rdi. 

Definition 3 The extent of a on Q is defined as : 

extQ{a) = {cj G Q/a{u;) = true} = a~^{true) 

2. Galois lattice of complex data 

Galois lattices help in many domains to represent, interpret and identify 
data [Wille 90, Duquenne & Guigues 86, Godin & Missaoui 94]. Classically, 
complex data are recoded before treatment. 

Galois lattices construction has been extended from binary case to more 
complex data by incremental and non-incremental algorithms [Polaillon & 
Diday 96]. We obtain Galois lattices where each node is a ’’complete asser- 
tion object” [Diday 91], which generalizes the intent of a binary concept in 
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Var 1 


Var 2 


Ind.l 


[0.75,0.8] 


[1,2] 


Ind.2 


[0.6, 0.8] 


[1,1] 


Ind.3 


[0.5,0.7] 


[2,2] 


Ind.4 


[0.72,0.73] 


[1,2] 



Table 1: Exemple of data 

the case of more complex data. Two kinds of Galois lattices can be obtained 
depending on our aim : if we are interested by classes described by proper- 
ties satisfied by all individuals belonging to the class (intersection lattice) or 
satisfied by at least one of them (union lattice). 

Galois lattices have the following advantages : we obtain all possible “com- 
plete assertion objects” , each node is automatically describe by one of them 
and his extent, they are ordered by a generalization/specialization relation. 

We consider the table 1 with interval data as an illustrative example, but 
algorithms work with other kind of data (multivalued, histograms) : 

Figure 1 gives the intersection lattice : nodes are described by their 
extent and new objects (found by considering inheritance). 



{ 1 , 2 . 3 , 4 } 
al(w) 




Figure 1: Intersection lattice 



Corresponding objects, without inherited properties, are : 
al=true 
a2=[j/2 D[2,2]] 
a3=[y2 2[l,l]] 
a4=[yl D[0.72,0.73]] 
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a5=[y2 D[l,2]] 
a6=[yl 3[0.6,0.7]] 
a7=[yl 2[0.5,0.6]j 
a8=[yl D[0.75,0.8]] 

a9=[2/l 2[0.7,0.72]] A [yl D[0.73,0.75]] 

The complete object corresponding to each node is calculated by the 
conjunction of the new object of the node and objects inherited from prede- 
cessors. For example, complete object for class {4} is a4 A a5A a3A a2. 

Figure 2 gives the union lattice : nodes are described by their extent and 
their complete objects. 



11.2.3,4) 

ai=(yiQo.5,o.8i]My2Qi.2]i 




a5=[yl C[0.5.().7)] ^ [y2 C(2.2]] a6=[yl C(0.72.0.73]] ^ [y2C [1,2]] a7=(yl C[0.75,0.8]] [y2C[1.2]] a8=[yl C[0.6.0.8]] ^ [y2 Ql,l]] 




a9=[yl C 0 ]'^[y2 C 0 ] 



Figure 2: Union lattice 



3. Galois lattice interpretation 

Data analysis with Galois lattices leads us to two problems : the size of 
Galois lattices grows up with the number of individuals and attibutes, and, 
there is no automatic aid for a non expert to interpret Galois Lattices. So 
we have to simplify the representation of Galois lattices. [Godin & Missaoui 
94, Guenoche 93] are interested by reducing lattices into binary trees, and 
[Duquenne & Guigues 86] are interested by extracting a minimal set of rules. 

3.1 Reduction via pyramid, hierarchies 

Galois lattices have been constructed by mathematical theorems, but the 
sense of data is not taken into account. 

We compare two kinds of algorithms, which are applied directly on the 
two lattices (intersection and union lattices) : 

• This algorithm is based on a similarity successor /predecessor (this al- 
gorithm was first proposed by [Bournaud 96]): 

To obtain a hierarchy from a Galois lattice, for each node (starting 
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with specific complete objects), we choose the “nearest” predecessor 
according to a distance and we suppress links with other predecessors. 
At the end, we suppress nodes which are not linked at least with one 
individual. 

• This algorithm is an extent of the hierarchical classification (first pro- 
posed by [Guenoche 93]) and the pyramidal classification : 

To obtain a hierarchy : We start with specific complete objects, while 
objects are active, we look for the most similar pair of objects, which 
are desactived, we find their common parent, which is actived, and 
links are updated. 

To obtain a pyramid : The algorithm is quite the same. Differences 
are : objects can be associated two times to other objects and order 
contraints must be respected (see the initial algorithm of [Bertrand & 
Diday 90]). 

Many distances between symbolic objects can be found in the litterature, 
but we have specificities that are never taken into account. For example, we 
can not choose the same distance in the union and intersection case, missing 
values have particular interpretation, the “nearest” predecessor is the one 
which fits at best, we have to take into account the order induced by the 
lattice... We define a new distance, which includes numerical and symbolic 
criteria. The numerical criterion is based on [Gowda & Diday 94] dissimi- 
larity by changing the position coefficient (we take into account the upper 
bound, otherwise similar situations can lead to different results) and includ- 
ing the case of missing values. The symbolic criterion takes into account the 
lattice order by the use of the distance proposed by [Leclerc 94]. 

We obtain a hierarchy or a pyramid without choosing an agregation cri- 
terion. 

With the previous example, from the intersection lattice, we obtain figure 
3. 

3.2 Reduction via identification tree 

It is a descending approach which is different from the divisive approach as 
we work on the Galois lattice. Many discriminant criteria can be found in 
the litterature [Guenoche 93, Ziani96]. The advantages of our method are 
that we work with complete objects (we have all robust descriptions) and 
we already have all possible variables splits (we don’t need other criteria for 
splitting each variable domain and choose the best one). 

We compare two algorithms : 

• this algorithm is close to the divise approach : 




438 



{1,2, 3, 4} 11,2,3,4} 11,2,3,4} 




{1} {4} {2} {3} {1} {4} {2} {3} {3} {1} {4} {2} 

Hierarchy by first algorithm Hierarchy by second algorithm Pyramid by second algorithm 



Figure 3: Trees obtained with ascendant approach 

Starting at the top of the lattice, given an optimal criterion, for each 
successor node, we choose those which satisfy the criterion, and we 
continue until bottom nodes are reached. 

• this algorithm is quite intuitive and is inspired by neural networks: 
Starting at the top of the lattke, we choose successor nodes where indi- 
viduals have greatest probabilities to go, and we continue until bottom 
nodes are reached. This method requires a bootstrap validation. 

We obtain directly an identification tree which works on all variables at 
the same time, ignoring a priori classes and where nodes are complete asser- 
tion objects. 

The previous example is too small to apply the first algorithm, but with 
the second algorithm, we obtain the figure 4. 



11 , 2 , 3 , 4 } 11 , 2 , 3 , 4 } 




{ 1 } { 4 } { 4 } { 1 } { 2 } 



Union "tree" Intersection "tree" 

Figure 4: Trees obtained with the probabilistic descending approach 



3.3 Duquenne’s rules 

The reduction and the interpretation of the lattice by a minimal set of rules 
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between attributes in the binary case have been proposed by [Duquenne Sz 
Guigues 86]. 

In the case of multivaluate data, interval data, histogram data, the extent 
of the rules extraction is evident, as we work on inclusion of extents. 

The property used in the demonstration is immediatly extended : the 
implication A ^ B is true if and only if A’ C B’, where A, B denotes a set 
attributes, and A’, B’ denotes objects having A, B. In the case of complex 
data. A, B denotes a set attributes values and A’, B’ are extent of objects 
composed by descriptions A, B, with relation R=D. Immediatly, we have 
the same property. We can use Duquenne’s inference rules (B,C,D denotes 
attributes) in order to find the minimal set of rules : 

• B ^ D is a consequence of B — C and C D 

• BU X — > Du X is a consequence of B -> D 

• B D is true when B D D 

From a complex table, we obtain directly rules between complex data. 

By treating the complete intersection lattice of figure 1, we obtain the 
following rules : 
a8 — y a3 
a4 a3 
a9 a4, a6 
a3, a6 — > a4, a8 
a4, a8 a9 
a7 -> a2, a6 
a2, a6 a7 
a5 — > a2, a3 
a2, a3 a5 

4. Conclusion 

We treat directly complex data tables. Galois lattices give us all “complete 
assertion objects” , which represent a robust mathematical space to work on. 
By taking in account different numerical criteria (dissimilarities on complex 
objects, discriminant distances...) and lattices properties, it is possible to 
extract a hierarchy, a pyramid or an identification tree from Galois lattices. 
With classical criteria, we find the same results found in the litterature, but 
the difference is that we are sure to obtain robust complete objects. By 
taking into account properties of the lattice, it is possible to summarize the 
Galois lattice as a minimal set of rules between complex attributes directly. 
The perspective of this work is to validate all approachs, may be by com- 
paring them to each other. 
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Abstract: The analysis of object-oriented datasets is a challenging task since 
significant differences exist between the usual data formats and software 
objects. An approach towards the definition of an object proximity measure is 
presented here which fits the basic principles of the object paradigm. 
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1. Introduction 

The ever growing demand of analysis for complex real-world datasets (Diday 
87) has put the emphasis on issues related to the integration between data 
sources (e.g. databases or knowledge bases) and analysis tools (Piatetsky- 
Shapiro et al 91). Our own study focuses on the application of automatic 
clustering within object knowledge bases (KB). An object base constitutes a 
model of a real-world domain where domain relevant knowledge is expressed 
through objects organized into class taxonomies. Clustering may be helpful for 
the task of class building when only an unstructured set of objects is available. 
It requires a consistent proximity function for objects, i.e. one which processes 
objects within their original context, the KB, and with respect to all the relevant 
information contained in the KB. 

In this paper we discuss the features of the object formalism which make 
two objects similar and provide effective means for measuring the induced 
proximities. Moreover we address the problem of combining the contribution of 
the different object aspects into a global object proximity measure. The 
described function is complete, universal and flexible, and therefore enables 
sensible application of classical clustering techniques on objects. 



2. Object based knowledge representation 

Object formalisms concentrate the knowledge about a particular element of the 
modeled domain into a single expression, a structured object. Moreover, they 
distinguish between individual knowledge, expressed as objects and generic 
knowledge, modeled through object classes. In the following, to illustrate the 
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object paradigm we provide an example of a KB modeling a real-estate domain. 
The KB the description language of TROPES (Sherpa project 95), an object 
representation system developed by our research team. 

2.1. Objects 

A structured object may be seen as a list of valued fields (attributes) where each 
field embodies a particular aspect of the modeled element. For example, an 
object representing a flat will have fields like rent, standing, rooms, owner, etc. 
In Tropes, objects are divided into disjoint families, called concepts (e.g. flat, 
human, room). A concept defines the structure, i.e. the set of fields, of the 
belonging objects, its instances : the instances identifiable as flat# 5 and 
f iat#i8 both have a rent, an owner, etc. Fields model features of three kinds: 
properties are element own features (e.g. rent, standing), components introduce 
element parts which are objects themselves (e.g. flat and its rooms), links model 
other relations between elements different from composition (e.g. ownership in 
case of the owner field). TROPES uses a strong typing, i.e. each field is assigned 
a type, i.e. set of admissible values and operations. Property fields are typed by 
simple data types : here rent and age are both of integer type. Relational fields 
are typed by concepts : in the real-estate base, rooms has the room concept as a 
type and owner human. Finally, multi-valued fields of two kinds: sets and lists 
are admitted. Thus, rooms is a set-valued attribute, for example it relates 
flat#5 to the respective set of room instances (noted fiat#5. rooms = 
{room#24, room#31, room#6}). 

Figure 1: Example of two sets of related objects corresponding to flat#5 and 
flat# 18.* flat owners, human# 3 6 and human# 62, and room sets are included. 
Relational fields are drawn as labeled arrows pointing at the target object 
whereas properties are directly cited. 




2.2. Classes 

Classes describe sets of objects (in TROPES sub-sets of concept 
instances) intentionally, i.e. by providing information about object fields: types 
and value constraints. Field constraints on properties basically define sub- 
domains of the field domain fixed by the concept. Imagine that kitchen 
restricts the surface field, originally fixed in room to a positive integer by : 
kitchen, surface = [9 15] (to read surface of a kitchen is between 9 and 15 
m^). Relational fields are constrained by providing a class of the type concept. 
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The set of all constraints constitute a necessary condition for membership in a 
class: instance field values must satisfy all those restrictions. Thus, room#24 
and room#45 being members of kitchen within the room concept (see Fig.2) 
both satisfy the constraint on surface. Classes of a concept are organized into a 
taxonomy by specialization, i.e. a-kind-of relation distinguishing a super- 
class and a sub-class. Specialization implies both set inclusion of class 
extensions and growing strength of value restrictions. An object is attached to a 
single class (is -a relation or instanciation), but is member of all its super- 
classes. This means room#24 and room#45 are instances of the service class as 
well. In sum, a KB should be seen as a set of inter-related objects described by 
fields and belonging to various classes. Moreover, classes express taxonomic 
knowledge about their instances which is complementary to object descriptions. 



Figure 2: Example of object instanciation: the taxonomy of the room concept is 
drawn with the attachment class for each room. 




2.3. Analysis issues 

Classes group objects of identical structure and with field values laying within 
restricted domains, that is class instances are similar in a way. Therefore, given 
an unstructured set of objects, when homogeneous object clusters appear within 
it one may expect those clusters to represent classes. The main topic is thus to 
formally state a criterion for object similarity with respect to object paradigm 
features. 

Objects are thoroughly characterized by the set of their field values, simple 
values or other objects, and by the classes they belong to. With respect to what 
has been said previously, measuring a proximity between a couple of objects 
requires effective evaluation means for elementary proximities on properties, 
components, links and classes. 

Besides, whereas properties are local to an object, the relational fields introduce 
dependency between objects (e.g. between a flat and its owner). Actually, the 
set of all objects related to a given object by chains of relational attributes may 
be seen as a network (see Fig.l for an example). Both nodes and arcs in the 
network are labeled : nodes by the concept and, possibly, by an attachment 
class, whereas arcs by the field name. The network represents the smallest 
context in which an object appear. Therefore, a sensible proximity measure for 
objects should combine the contribution of each element of the respective 
networks. Please notice that what is needed is a proximity for nodes rather than 





444 



for entire networks. In the next section we present an object proximity which 
computes contributions of related objects with respect to the length of the chain 
linking them to the compared objects. 



3. Object proximity measure 

Let C be a concept, e.g. flat, with n fields and let dom(a) denote the domain and 
type(a) the type of a field a.. For a couple of concept instances o and o\ let 
o.fl.=v and o\a^v' denote the respective field values and 5j the field-specific 
proximity function. Notice that all real-valued functions below are given in a 
dissimilarity form and normalized. 

3.1. Elementary proximities 

In case of a property, v and v\ let type{a)-T, The values might be compared ad 
hoc to T (e.g. by number subtraction). However, in systems like TROPES, non- 
standard and external data types may be used (e.g. Date, Chemical element. 
Nucleic acid, etc.) and all types are treated equally. This requires a generic 
comparison : we assume two values to be as similar as a type expression 
generalizing both (e.g. an interval for a couple of numeric values) is specific. 
Thus, let T be the most specific generalization of v, v\ then their proximity 
is the relative size, noted II II, of x* : 



^f(v,v') 



lltll 

ll^om(^)ll 



On Fig.l, the difference between the rents of both flats is 
(2400,3200) = 0,17 with \\dom(rent)\\ = 4800 and x = [2400,3200]. 

For a relational field a., v and v' are instances of the concept C'=type{a). The 
more similar those objects, the greater the proximity of o and o'. Both for link 
and component fields, the proximity of the referred objects, within their own 
concept (see next subsection) is substituted to S/ : Sc'{v, v') = dc ( v, v') . 

Still another aspect of object comparison is the processing of multi-valued 
attributes a., (e.g. the rooms attribute). In this case v and v' are collections of 
elements for which a proximity S/ is assumed available. As collections may 
have variable length one needs a matching between their members. The 
matching should, in addition, optimize the sum of the pair-wise proximities 
Sf of matched collection members. The exact procedure of finding an optimal 
matching M^^^(v, v') for sets and lists may be found in (Valtchev et al 97) and 
is omitted here for space reasons. The collection proximity is the ratio of total 



In (Valtchev et al 96) a slightly different definition based on graph distance is given. 
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proximity of the matching (possibly corrected by the unmatched elements) and 
the maximal collection size : 



s-(vy,s{,Mo„{vy)) = 



{e,e')eM„p,(vy) 

max(/,/') 



where I and V are the sizes of v and v' respectively. For example, an optimal 
matching for fiat#5. rooms and fiat#i8. rooms with equal to S^rLm is 
^opt- ((room# 2 4 , room#45),( room#31 , room#23),(room#6 , room#2)) with 

S'^ of 0.5 (room#i7 is unmatched). 

The proximity of a couple of objects o and o' with respect to their attachment 
classes c and c' may be evaluated according to the same principle of least 
generalization. Thus, c and c’ are as similar, as their common super-class c 
within the taxonomy H of is specific. The effective measure Sy(o,o'), 
described in (Valtchev et al 96), is based on the graph distance, its value is 
proportional to the length of the shortest path linking o and o' in the taxonomy 
H, or, what is equal, the total number of intermediate classes between each of 
the objects and c . For example, ^^(room#6,room#2) = 0.33 since both objects 
share the attachment class bedroom, whereas ^^^(room#17,room#2) = 0.66 
with respect to the common super-class basic (see Fig.2). 

3.2. Object-level function 

Suppose the field functions cons and nat return the constructor (one for single- 
valued fields, list or set) and the nature (property, component or link) of 
an object field. Notice that all real-valued functions below are normalized. 

Given a concept C the object dissimilarity function d^:CxC is an 
additive function of the class dissimilarity and the elementary proximities 
on each field 3/ . For a couple of objects o and o', 

dc(oy) = \ * Sc (o,o'')+'^Xi *Si^{o.a^,o\a.) ( 1 ) 

J=1 



where A, are the user-specified field weights (^A, =1). Please notice that if 

/=o 

no taxonomy is available on C then and Aq are set to 0, Next, Sf on a field 
a. depends on its constructor: 

\S'(v,v') cons(fl,.) = one 

^/(v,v) = -^ r 1 

(v, v’. S ' , M (v, v’)) cons(^ ■ )6|set, list) 
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Here ST is a collections dissimilarity and S is a generic function further 
instanciated with respect to the field nature : 

Cf. nat(a, ) = property, type(a, ) = r 

O (v,v ) = < f , 

naf(a.)e {link, component J;type(^i,) = C’ 

Here Sf is the elementary proximity function on whereas is the (also 
normalized) object proximity on concept C'as defined by (1). 

3.3. Computation issues 

Given a couple of objects o and o \ d^ may be only computed if all values of 
S/ are known in (1). Property and for component proximities are always 
possible to calculate prior to the object proximity computation. In contrast, 
links may introduce circuits in the networks thus making dc(o,o') recursively 
depend on itself For example, if for a link a- there is a field a'j in C’=type{a.) 
describing the inverse relation, i.e. for all e in C\ o.a- = e iff e.a'j = o . The 
couple a., a'j establishes a cyclic dependency between d^ and d^,. In the 

real-estate KB, the fields owner and house constitute such a couple between 
house and human. Cycles may occur where more than two objects are involved. 
To find the effective values of the pair-wise proximities, the formula in (1) may 
be seen as an equation where those proximities are variables (Bisson 95). Thus, 
picking up all the formulas for object couples which take part in circuits within 
the networks of o and o' we obtain a system of equations. For each equation, the 
coefficients are field weights and the variables object proximities for couples of 
objects which are related to o and o’ by the same chain of links. The right-hand 
side member is the local part of the respective proximity, which is made up of 
the known values : property, component and class proximities (if any). The 
obtained system is square (m variables and m equations) and linear provided no 
multi-valued links exist in a circuit. In this case, the system matrix is diagonal 
dominant so the system has a unique which could be found by an iterative 
method. In case of multi-valued links, all the equations remain linear, but a non- 
determinism is introduced due to the matching in the computation of 5 "^ . This 

amounts to solving several linear equation systems in parallel, one per 
matching, which would be too expensive. Instead, a heuristics may be used to 
make the iterative method fit the new situation : a matching could be carried out 
after each iteration (for details the reader is invited to look in (Bisson 95)). 
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4. Example 

Suppose one has to compute the proximity of the flats on Fig.l. As indicated, 
there is a circuit between flat#5 and human#3 6 on the one hand and flat#i8 
and human# 6 2 on the other hand. First, we assign equal weights to the field 
functions in all three concepts. Thus, age and house are weighted 0.5 in human 
(no taxonomy is given), rent, owner and rooms score to 0.33 in flat, whereas 
surface is weighted 0.5 in room due to the class dissimilarity <5^ which weight 
is 0.5 too. Next, let the ranges for the fields age, surface and rent be 80, 20 
and 4800 respectively. The object proximity may be computed 

independently from the linear system. For example, 
(room#24,room#45) = 05^ (0.1-K).33) = 0.22 . 



Table 1 : Values 



d" 

room 


room# 4 5 


room# 2 3 


room# 2 


room# 17 


room# 2 4 


0.22 


0.43 


0.63 


0.7 


room# 31 


0.5 


0.19 


0.75 


0.83 


room# 6 


0.65 


0.8 


0.24 


0.5 



Using the above values for we obtain the same optimal matching between 
room collections Mopt as in the previous section. This time however the 
collection proximity is 0.45. 

Finally, the proximity of owners according to their age is 

(J^^^(31,55) = (5‘i^[^(31,55) = 0,3 and the proximity of flat rents is 

<5^{„^(2400,32(X)) = <5i^[(2100,32(X)) = 0,17. The linear system can now be 
composed with two variables : standing for , fiat#i8) and 

for d human (human# 3 6 , human# 62). 

JCI = 0.33 *(JC2+ 0.45 + 0.17) 

JC2 = 0.5 * (xi+ 0.3) 

After a resolution, we obtain x, = 0.304 and x^ = 0.308. 



5. Conclusion 



In the paper, the object formalism specific features have been discussed with 
regard to their contribution to object proximity. The results of the analysis have 
been used in the definition of an object proximity model. The proposed measure 
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presents the following advantages. First, it is complete with respect to object 
description: potentially all the available knowledge about object interpretations 
may be taken into account. Next, the function respects the fundamental 
principles of the formalism like hierarchical organisation, data abstraction, 
object composition, etc. In the same time, the dissimilarity is tunable, i.e. 
parameters can be fed-up to adapt the function to a particular domain and 
situation. The behavior of the function on datasets of complex structure and 
greater size is still to be tested. 
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Abstract: The paper addresses the problem of analysing answers to an open 
question observed in different waves of a repeated survey. Multiway data 
analysis techniques can offer interesting tools for doing that. Unfortunately, 
dealing with textual data some peculiar problems arise. The aim of this paper is 
to propose the use of non-symmetrical data analysis techniques in order to 
follow lexical behaviours with respect to a set of explanatory numerical 
variables, defining groups through time. Furthermore, attention is paid to the 
definition of a conjoint vocabulary. 

Keywords: Principal matrices. Graphical Displays, Textual forms. 



1. Introduction 

One of the most common textual data analysis techniques consists in an 
ordinary Correspondence Analysis performed on a lexical table (Lebart, Salem, 
1994), built by cross-classifying a lexical variable with a categorical (or 
categorised) variable. Dealing with a questionnaire, generally the answers to an 
open question are cross-classified with the answers to a close-ended question. 
Balbi (1995) proposes the use of non-symmetrical correspondence analysis 
(Lauro, D’Ambra, 1984) in order to emphasise the different roles played by the 
two variables/questions. As a consequence of the chosen method, distances 
among occurrences on a factorial plane are measured in an usual Euclidean 
metric, which takes into account the marginal frequencies. Thus the identified 
structure is strongly influenced by common words. 

The aim of this paper is to extend the non-symmetrical approach to the case we 
deal with open-ended questions in repeated surveys. The idea is to distinguish 
groups of individuals (pseudo-panel) identified by time-invariant categorical 
variables and to follow their behaviour in time, by displaying trajectories on a 
principal plane obtained by means of a Principal Matrices Analysis onto a 
Design Matrix (Balbi, Lauro, Scepi, 1994). One of the most interesting 
problems in such a kind of approach consists in a proper definition of the 
vocabulary. It is essential to understand how words change their 
characterisations through time. Textual forms have to be deeply investigated. 
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Tools useful in building one-time vocabulary, such as regions defined by 
bootstrap techniques (Balbi, 1995), can be applied in order to compare the same 
form observed in different occasions. 

After introducing the peculiarity of the data structure (Sec. 2), the principal 
methodological tools used are presented in order to suggest a strategy for 
analysing the answers to open questions in repeated surveys (Sec. 3). The 
problem of defining a proper vocabulary is faced in Section 4. The paper ends 
with an application in a slight different context: the comparison of different 
advertisement campaigns on Italian press related to the life cycle of a product, 
the cellular telephone, from elitist consumers to a mass market (Sec. 5). 



2. The Data Structure 

In many countries there is a severe lack of panel surveys, while a notable 
amount of information is available through repeated independent cross-sections. 
Thus, the same questionnaire, usually consisting in Qi close questions and Q 2 
{«Qi) open questions, is submitted to T different samples, not necessarily of 
the same size, in T different occasions. In the following we consider the case in 
which 22 =1, and the answers to this question have to be analysed by means of a 
textual data analysis technique, instead of recoding them a posteriori. 

Thus the data structure at hand consists in Xj, ..., Xt, ..., Xj (jiu + m) matrices 

(with t = ..., T). The total number of individuals is n = 

partitioned into two submatrices: the lexical matrix Vt (nt , Wt), individuals by 
the Wt forms used in answering the open question and an indicator (ut , m) 
matrix T,, containing the answers of the same individuals to a subset qi of the 
Qi close-ended questions in disjunctive coding, with m = XyLi > being m, the 

number of categories of the 7-th close question considered. For example, we can 
consider two demographic variables: sex and class of age in three categories: 
young, adult, aged, being F, a (/i, , 6) matrix. 

As the interest of the analysis usually consists in understanding the relation 
between vocabulary and sociodemographic characteristics of the individuals, we 
fix the latter as a structure common to all the occasions. If those variables are 
the stratification factors in the sampling plan, each wave of the survey can be 
seen as a replication of the experiment under the same conditions. This 
approach was proposed by Balbi, Lauro, Scepi (1994), dealing only with close 
questions, but hypothesising a dependence structure in the questionnaire. 

Thus we transform each Xt in an At (wt, m) matrix, by aggregating individuals 
according with their own condition, defined by a combination of the socio- 
demographic variables. The so obtained three-way structure consists in T (wt, m) 
At matrices, suitable to be analysed by an interstructure - compromise - 
intrastructure method, such as STATIS (Escoufier, 1987) or Principal Matrices 
Analysis (Rizzi, 1989). 
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Such a kind of methods allows to build a reference structure where to compare 
the behaviour of one word through time, or to study how the so defined groups 
of individuals change their vocabulary from one occasion to another. 



3. The Analysis 

As in one-time textual data analysis, each At can be seen as a contingency table, 
which presents in its general cell the number of times the i-th word appears in 
the answers given by the j-th group of individuals. In STATIS-like method (see, 
inter alia, Glagon, 1981), when we deal with contingency tables, they are 
transformed in the corresponding Burt matrices, in order to compute, for each 
couple of matrices, the Rv coefficients (Robert, Escoufier, 1976), general 
elements of the interstructure matrix S. 

After the spectral decomposition of S, which enables to represent each At as a 
point on a factorial plane, the compromise matrices Ca are built: 
Cct = , where u^t is the t-ih element of the first eigenvector 

corresponding to the a-th eigenvalue of S, in decreasing order. 

Following Balbi (1995), a non symmetrical coiTespondence analysis (NSC A, in 
the following) is performed on c (oc=l, ..., c) Ca- The choice of a proper c (i.e. 
the number of principal matrices considered) is determined, like the number of 
components in principal component analysis, by using classical criteria, such as 
the scree-test, or looking at the percentage of S trace, relative to its first 
eigenvalues. As in the present, when the variable defining the third way is time, 
the decomposition of the total variability in principal component of decreasing 
importance can be read, using time series language (D’Alessio, 1989). Thus the 
first matrix explains the main structure of the analysed phenomenon, while the 
second one a cyclic component, then a seasonal one, and so on. 

In the last step, the so called intrastructure step, original elements (words and 
categories of the sociodemographic variables), observed in the T different 
occasions, are represented on the NSCA first factorial plane of each C«, being 
projected as supplementary points (Lebart, Morineau, Piron, 1995), and 
trajectories for homologue elements, observed in the T times, can be drawn. 

The previous analysis has to be performed on matrices having the same 
structure. Grouping individuals according to some time-invariant characteristics 
overcomes the problem for columns. Furthermore, a second point has to be 
considered for rows: how to deal with different vocabularies, in order to build a 
common corpus by texts belonging to the different waves of the repeated 
survey. Till now we have spoken of words, now we have to choose the units for 
the textual data analysis. Thus we have to put Wf = w, Vr e{l, ... , T}, i.e. we 
have to build and consolidate the vocabulary we want to analyse. 

Differently from one-time textual data analysis, in which this step is not always 
necessary and it is preferable not to carry it out too early, dealing with multiway 
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textual structure we need to perform a consolidation procedure in the 
preliminary step of the analysis. 



4. Building a Common Vocabulary 

In textual data analysis, the first question to be solved consists in choosing the 
statistical units. Vocabularies are generally built on the basis of grammatical 
rules (e.g. nouns are considered in their singular form) which can differ in 
different languages. Such a kind of standardisation of forms means a heavy 
recoding of the data, and can destroy some interesting information. Lebart 
(1981) suggests the use of graphical forms (i.e. series of non-delimiting 
characters, g. /’s, in the following): they are easier to be stored and not 
influenced by recoding (Lebart, Salem, Berry, 1991). Bolasco (1994) proposes 
the use of textual forms, defined as a mixture of pure g. f ’s, lexemes derived 
from g. f ’s, segments (i.e. consecutive g. f ’s), in order to both eliminate any 
ambiguities and preserve the semantic variability in the analysed text, as well as 
merge all the elements which are semantically invariant. In other words, the 
contextual aspects have to be taken into account. 

Dealing with data observed in different times, vocabulary has to be properly 
consolidated by declaring the equivalencies among g. /’s, which correspond to 
one word in the further analysis. This pre-treatment requires extreme attention, 
because equivalencies valid in one period can be misleading in a different 
moment. On the contrary, willing to use the data as a whole, some equivalencies 
can be imposed in order to save the comparability among themes, observed in 
different occasions. 

All this means that a vocabulary common to all the different waves of the 
survey has to be built. As pointed out in a previous paper (Balbi, Esposito, 
1997), sometimes this may seem to be artificial and may happen that interesting 
forms, not present in all the occasions, have to disappear. The preservation of 
the textual forms in order to compare the behaviours in the different occasions 
of the groups/words on the same factorial plane seems to justify this choice. 

In doing that, tools proposed in one-time analysis (e.g., the so called iso- 
frequency criterion proposed by Bolasco, 1993, or the bootstrap convex hulls 
co-ordinates proposed in Balbi, 1995) seem to be very useful. 

The iso-frequency criterion is interesting for nouns (and adjective in languages 
in which, e.g., the masculine form differs from the feminine one) in order to 
decide if we have to distinguish singular and plural forms: if the two forms 
appear almost an equal number of times, they can be legitimately merged. 

The use of bootstrap convex hulls can be extremely useful in all the cases we 
want to evaluate if different graphical forms of the same morpheme have 
actually a different representation on factorial planes and analogously for 
synonyms. In Fig. an example dealing with verbs: analysing the publicity for 
computers in 1995 on Italian political magazines. The problem of considering 
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the only infinitive form for the verb avere (to have) was solved by observing the 
different use of the first person plural form by the leader IBM and the second 
person by the other producers. 

Figure 1: Bootstrap convex hulls for factorial coordinates: computers 



advertisements on Italian press in 1995 




Dealing with T matrices, convex hulls can be obtained by the analysis of 
compromise. The projections on its first factorial plane of words belonging to 
different times, together with the form and the extension of the convex hulls, 
can give useful suggestions in building textual forms through time. 



5. The Cellular Telephone on Italian Press: From Elite to Mass 

In the Eighties, in Italy, the cellular telephone was the symbol of success, a tool 
for busy, rich and powerful men. The begin of Nineties and the end of yuppism 
meant that the cellular telephone market had to deeply change. The old image 
of the product had to be destroyed: cellular telephone should become an easy 
tool for young people, and for women, to be used for chatting and not for 
business. Thus, it is interesting to understand how the advertisement campaigns 
appeared on Italian press have participated to the process of transformation 
from an elite market to a mass one. Additionally, it is worth to notice the 
influence of the new Italian economic policy, characterised by the privatisation 
of some services, such as telephone, and the end of the monopoly of State. In 
1995, the market of cellular telephones connections became a duopoly, private 
(OMNITEL) against public (TIM, before SIP), both using the same physical 
net. Furthermore, technological developments are quickly changing the 
characteristics of the product. All these aspects make the study of the evolution 
of the language in this market especially interesting. 
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In order to understand the evolution of the language with respect to the new 
target, we have considered the advertisements by taking into account the kind of 
press where they appeared (popular or specialised, i. e. political or technical 
magazines) and the style of presentation, i.e. informative (technical) or 
imaginative (descriptive). The dependent variables are represented by words. 
The three occasions are represented by the three years (1993, 1996, 1997). The 
explanatory categories are the time-invariant features of messages. 

Recall that words should be the same in the occasions, thus the first problem in 
the analysis consists in building the three matrices A93, Age and A97. A 
preliminary recoding has been adopted, by mainly using lexemes, instead of 
graphical forms. In some cases, the root has been taken into account (e. g. 
innoVazione and innovativo/a/e/i, in order to take into account the theme of 
technological innovation). In other cases, when a graphical form is very 
frequent (e. g. telefono or telefono-cellulare), we have referred to the graphical 
form. Additionally, topics like services for calling abroad have been merged 
under the label Europa, or the reference to the name of each company under the 
label marca. Words with a length smaller than three characters have been 
ignored. When a g.f was in two of the three matrices, with high frequencies, it 
was introduced in the third one, by a row of O’s. 

The analysis of interstructure (Fig. 2) shows a strong opposition between the 
campaigns in 1993 on the left and 1996/1997 on the right, suggesting the use of 
one compromise (c=l). 

Figure 2: Interstructure first factorial plane 




The NSCA first factorial plane (explaining the 80% of the dependence of the 
vocabulary on the explanatory categories) shows a first axis (46%) characterised 
by the language used on the popular press, while a second one (34%) on the 
specialised magazines (Fig. 3). It is interesting to underline (Fig. 4) the 
opposition of telefonino on the right (i.e. more technical advertisements) and 
telefono on the left. The advertisements using rhetoric and/or iconic tools 
(“descriptive”) use ordinary references: conversazione (talk), fare (to do), piu 
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(more), but they are also characterised by the more technical words, i. e. GSM 
and stand-by. On the contrary, the more informative messages pay more 
attention to price {lire, -mila). For specialised magazines too, the conditions of 
payment appear on technical advertisements, while telefono cellulare, and 
telefonino, near to semplice are on the “more descriptive” side. 

Figure 3: Compromise first factorial plane: the explanatory categories 




Figure 4: Compromise first factorial plane: words 



non 
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sapcff“= X, = 46.1% 
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The last step, the intrastructure, allows to compare the changing of the 
coordinates of some words through time. Fig. 5 shows some examples. 

Figure 5: Trajectories for words: e.g. ^^marca^^ and ^^telefonino^* 








456 



References 

Balbi S. (1995). Non symmetrical correspondence analysis of textual data and 
confidence regions for graphical forms. In: Bolasco S., Lebart L., Salem 
A. (eds.). JADT 1995, Roma: CISU, 2, 5-12. 

Balbi S., Esposito V. (1997). Comparing advertising campaigns by means of 
textual data analysis with external information. In: JADT’98. Nice (to 
appear). 

Balbi S., Lauro N. C., Scepi G. (1994). A multiway data analysis technique for 
comparing surveys. Methodologica, 3, 79-90. 

Bolasco S. (1993). Choix de lemmatisation en vue de reconstruction 
syntagmatiques du texte par T analyse de correspondance. In: Anastex S. 
J. (ed.), JADT'93. Paris: Telecom. 

Bolasco S. (1994). L’individuazione di forme testuali per lo studio statistico dei 
testi con tecniche di analisi multidimensionale. In: Atti della XXXVII 
riunione scientifica della S.I.S., Roma: CISU, 2, 95-103. 

D'Alessio G. (1989). Multistep principal component analysis in the study of 
panel data. In: R. Coppi & S. Bolasco (eds.) Multiway data analysis. 
Amsterdam: North-Holland, 375-381. 

Escoufier Y. (1987). Three-mode data analysis: the STATIS method. In: B. 
Fichet & N.C. Lauro (eds.) Methods for Multidimensional Data 
Analysis, EGAS, 259-272. 

Glagon F. (1981). Analyse conjointe de plusieres matrices. These du 3-eme 
cycle, Grenoble. 

Lauro N.C.,D'Ambra L. (1984). L'analyse non symetrique des correspondances. 
In E. Diday et al. (eds.). Data Analysis and Informatics, HI, Amsterdam: 
North-Holland, 1984, 433 - 446. 

Lebart L. (1981). Vers l’analyse automatique des textes: le traitement des 
reponses libres aux questions ouvertes d’une enquete. In: J. P. Benzecri 
& collaborateurs, Pratique de I Analyse des Donnees. Linguistique & 
Lexicologie. Paris: DUNOD, 414-419. 

Lebart L., Morineau A., Piron M. (1985). Statistique exploratoire 
multidimensionnelle. Paris: DUNOD. 

Lebart L., Salem A. (1994). Statistique textuelle. Paris: DUNOD. 

Lebart L., Salem A., Berry L. (1991). Recent developments in the statistical 
processing of textual data. Applied Stochastic Models and Data 
Analysis, 1, 47-62. 

Rizzi A. (1989). On the synthesis of three-way data matrices. In: R. Coppi & S. 
Bolasco (eds.) Multiway data analysis. Amsterdam: North-Holland, 
143-154. 

Robert P., Escoufier Y (1976). A unifying tool for linear multivariate methods: 
the Rv coefficient. Applied Statistics. 25, 257-265. 



Acknowledgement: The present paper is supported by a CNR grant . 




Three-way textual data analysis 



Monica Becue-Bertaut 

Universitat Politecnica de Catalunya. 

Pau Gargallo, 5 - 08028 Barcelona. Spain, e-mail : monica@eio.upc.es 



Abstract : In this work, we present methods, stemming from particular 
three-way data analysis methods, to tackle two specific problems arising when 
analysing open-ended answers in surveys. The first problem consists in the 
comparison of the answers given to a same question in different surveys ; the 
second one is the study of the set of categories or individuals as described by 
their answers to several open questions. Specific software routines have been 
devised. 
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1. The basic problem 

Multidimensional descriptive statistical methods such as principal axes 
(principal component analysis and correspondence analysis) and cluster 
analysis methods constitute a tool for analyzing responses to one single 
question in surveys (see Lebart and Salem, 1997). 

The answers to closed-end questions lead to a classical (n,p) matrix X. To 
analyse open questions, the starting point is to consider the answers to each 
open question as represented by a (n,m) matrix Y whose i-th row contains the 
vector whose m components are the frequencies of the j words in the i-th 
individual answer. Y is called lexical table. When individuals are grouped 
according to a socioeconomic characteristic, the rows are the individual 
categories, and Y is called aggregated lexical table. Usually, only the most 
frequent words are kept. 

When several open questions are present, simultaneous analysis of the 
responses requires specific tools allowing for systematic comparisons. A 
similar problem arises from the repetition of the same question in different 
surveys. We propose here some methods stemming from three-way data 
analysis. 

Different aspects can be studied : the relationships between the structures 
induced by the questions ; the links between words used to answer different 
questions ; the detection of individuals or individual categories with similar 
responses to the whole set of questions or only to some of them ; the evolution 
of the responses relating to a same question, etc.. 
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From now on, we will focus on two main problems : 1) the comparison of the 
answers given by different samples to the same question ; 2) the study of the set 
of categories (or individuals) as described by their answers to several open 
questions. 



2 . Comparative analysis of the answers given by different 
surveys 

2.1 Example 

As an example, we will use a corpus of responses to a permanent survey, 
carried out twice a year using independent samples (Lebart, 1987). This survey 
was conducted by means of both closed-end questionnaire and open questions. 
We will process the answers to the question “What are the reasons that might 
cause a couple or a woman to hesitate having children?” as given by two annual 
surveys (1986 and 1988) of 2000 individuals. As grouping criteria, we use the 
age of the respondent, split into five levels (less than 25, 25 to 34, 35 to 49, 50 
to 64, 65 years and over). 

2.2 Methodology 
2.2.1 Data tables 

In the case of a same question asked at different periods, in different regions or, 
more generally, to different subpopulations, the set of words or “vocabulary” 
used by the corresponding samples can be considered similar. So we have to 
deal with a sequence of two-way tables Y^, defined by the same couple of 
variables (lexical variable and socioeconomic variable) and indexed by survey 
number t, t=l,...T (figure 1). 



Figure 1 : Sequence of tables 




At that stage we find ourselves within the scope of three-way data methods. For 
a synthesis of these methods see, for example, Carlier et al. (1988) and Kiers 
(1989). Escofier (1987), Escofier and Pages (1993) and Abdessemed and 
Escofier (1996) present tools to compare data tables. With regard to the 
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evolution of contingency tables, Carlier and Ewing (1992), Carlier and 
Kroonenberg (1996) and van der Heijden (1987) can be consulted. 

2.2.2. Corpus segmentation and tables building 

A first information is obtained when identifying and counting up the words 
being used. Their frequency in each survey are compared, and the significant 
differences recorded. In the example, veiy few words offer a significant 
frequency difference between samples. We can note that the frequency of the 
words corresponding to refusals (expressed using the expressions ‘7 don't 
know ”, '7 don V get you ”, or ''no idea ") are falling from the first survey to the 
second. 

In a second step, the answers are grouped according to the values of a 
categorical variable, and the aggregated lexical tables corresponding to each 
sample are built up, keeping only the most frequent words. The main goal is to 
highlight the common structure of these tables, and also to study the deviations 
of each table from this common structure. 

To achieve this goal, the tables are juxtaposed into a two-way table, Yj. 
There are two options, depending on which dimension, vocabulary or 
categorical variable, is chosen as a common dimension. The choice depends on 
the initial problem and also on the textual data nature. In general, we want to 
compare the structures of both categories sets, as induced by the words 
selected among a common vocabulary. If the most of the words are used with 
frequencies similar in the different surveys, it is legitimate to assume that a 
common vocabulaty really exists, as it is the case in our example. 

It implies that the vocabulary constitutes the common dimension, and the 
categories are repeated as many times as the samples are. The general term of 
the juxtaposed table is denoted fijt ; fiji denotes the relative frequency of word j, 
in category i and survey t. 

Another table is also built up ; the sum-table Yg. Such table is obtained by 
adding up the frequencies of the words con*esponding to a same category, 
independently of the sample. 

2.2.3 Analysis of the tables 

First step ; each table Y^, and also the sum-table Yg, is visualized through 
correspondence analysis. It allows to pin-point the principal features of the 
data, and to detect a possible common structure in the different tables Y^ 
Sum-table Yg analysis is the analysis of the mean profiles of the categories as 
computed upon all samples. Projecting the columns-categories of all tables Yt 
as supplementary columns upon the principal axes resulting from the analysis 
of the sum-table provides us with a visualization, on the same mapping, of the 
mean structure of the categories, and also of categories structures 
corresponding to each survey. 
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In the example, using age as grouping criterion, the three categories structures 
(coiTesponding to the sum-table and to each survey), are similar ; these age 
level categories have ordered coordinates on the principal axis, and the second 
axis opposes extreme categories (less than 25, 25 to 34 and 65 years and over) 
to intermediate categories (35 to 49 and 50 to 64 years). We can state that there 
is a quite regular evolution of the vocabulary with respect to age. 

Second step : the second step consists in analysing the juxtaposed table Yj 
through correspondence analysis. 



Figure 2 ; Correspondence analysis of juxtaposed table. Principal plane. 




Figure 2 shows the principal plane obtained in this analysis. The initial letters 
A and B identify categories as respectively belonging to first or second survey. 
The results depend on the relative importance of the variance of the categories 
within each survey (within-variance) and on the variance of the surveys 
centroids (between-surveys variance) (table 1). 



Table 1 : Within-surveys and between-surveys variance 



Variance 


space 


Axis 1 


Axis 2 


space 


0.0856 


0.0285 


0.0131 


within-surveys 


0.0689 




0.0054 (41.5%) 


b etween-survey s 


0.0167 


0.0080 (28%) 


0.0077 (58.5%) 
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Another important interpretation aid consists in the coiTelations between the 
principal components of each couple of tables (Yj,Yt), t=l, ,T to detect 
common and specific axes. In the example, the first principal component of the 
juxtaposed analysis is highly coirelated with the respective principal 
component of each survey analysis. 

The first axis is, above all, an axis describing the variance of the categories 
within each survey, it is an “age axis”. It opposes the words used by young 
categories to those used by old ones. 

The second axis depends on both variances, although slightly more on between 
surveys variances. The following axes depend, above all, on within survey 
variance. We can note that both categories structures indicate a quite parallel 
evolution of the vocabulary with age in both surveys. 

Third step : another analysis offers helps to visualize the differences existing in 
the coiTespondence between socio-economic variable and vocabulary 
according to the survey. The goal of this analysis is to provide us with a 
visualization of the internal structure of each survey, but representing all of 
these structures in a same compromise reference space (figure 3). 

The basic idea is the following : denoting N^, the subcloud of categories- 
profiles corresponding to the same survey t, each subcloud Nt is traslated so 
that its centroid coincides with the global centroid. The whole cloud - resulting 
of all these subclouds- is analysed using the same chi-square metric than in 
step 2. It is equivalent to analyse the juxtaposed table by correspondence 
analysis, but relatively to a particular model (and not relatively to the global 
independence in accordance to juxtaposed table mai*gins) (Escofier, Pages, 
1993). This model corresponds to the independence between categories and 
vocabulary within each table It can be shown that it is equivalent to process 
by classical coiTespondence analysis the table Yc whose general entry is : 



Upon the principal axis coiTesponding to this last analysis, it is useful to 
represent the categories using the projection of the initial categories-rows (i,t) 
as supplementaiy elements (B17, 1996) ; it is the way of each category being 
located in the centroid (thanks to a constant) of the words, giving them a 
relative weight corresponding to their importance in the responses of this 
category ; it allows one to find again a usual result when two-way 
correspondence analysis is used, but lost in this analysis in the case of the 
categories-rows. This analysis offers an optimal visualization of both internal 
structure, in the sense that, in the projection, the maximum internal variance is 
kept, taking into account both surveys. 
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Figure 3 shows the first principal plane resulting from this method applied to 
the example ; with regard to categories, they are represented by the initial rows 
as supplementary rows, as justified just before. 

To interpret the results, the relative importance of within-surveys and between- 
surveys variances are computed, and the axes common to different surveys are 
detected, as well as those which are specific to one of them. 



Figure 3 : Principal plane of the analysis of the table 
(centring the categories-profiles within each survey) 




3. Description of individuals or individual categories using the 
answers to various open questions 

When the same individuals answer several open questions, it is interesting to 
simultaneously study theirs responses. In this case, the individuals are 
described by several sets of frequency variables. 

3.1 Data tables 

There are as many “vocabularies” as questions. Each category/individual is 
described by the frequencies associated with the words within each vocabulary. 
A table is associated to each open question. These tables are juxtaposed, 
so that the rows represent the categories/ individuals and the columns are the 
words corresponding to the various questions q, q =1,...Q. 
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3.2 Methodology 

In a first step, each table is analysed by correspondence analysis (in this 
case, sum-table is meaningless). In a second step, the correspondence analysis 
of juxtaposed tables Yq (figure 4), but centring the words-profiles within each 
survey, gives us a tool to compare them^ using a common reference space. As it 
has been shown in paragraph 2, it allows one to detect common and specific 
axes in the different tables Yq. 

Figure 4 : Juxtaposition of tables 

I vocabulary 1 | | vocabulary 2 | | vocabulary 3 | 



individuals 



Each category/individual appears once, located at the centroid of the words 
used in the different answers. Although word sets are different for each 
question, some words - at least usual function words- will be repeated. In this 
case, they will appear as many times as they are questions. It provides us with 
a visualization of the interrelations between words, stemming which words are 
frequently present in the different answers of a same category/ individual. 
Comparing the different positions of a same word in the different questions 
will bring information about their role as elements of the discourse. 

To compare the structures of the categories according to the different 
questions, it is then possible to use the initial tables Yq as supplementary rows. 
It provides us with a representation which superimposes the description of the 
categories/ individuals as given by each question, and the overall description. 



4. Weighting the different tables 

A problem arises when a table plays a predominant role and, by itself, 
determines the first principal axis. When it occurs, it will be necessary to adopt 
a multiple factor analysis point of view and to reweight the tables Yt (or Yq) to 
balance their influence on the construction of first axes. But we deal with two- 
way tables with different row margins and a compromise weighting must be 
chosen for the rows. 
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5. Other issues 

It is possible to tag the coipus units using a morpho-syntactic analyzer, and to 
count the frequencies of the syntactic categories in the answers (Salem, 1995). 
The individuals ai*e then described by the words on the one hand, and by the 
syntactic categories they use, on the other. A systematic comparison of the 
structures induced by both sets will provide information about discourse 
features. The tools here presented offer the possibility to study and to compare 
how the individuals are described by the content of their responses and, 
simultaneously, by the syntactic features of these responses. 
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Abstract: The specific complexity of textual data sets (free answers in surveys, 
documentary data bases, etc.) is emphasized. Recent trends of research show 
that classification techniques (discrimination and unsupervised clustering as 
well) are widely used and have great potential in both Information Retrieval 
and Text Mining. 
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1. The scope of multivariate analysis of texts 

The amount of information available only in unstructured textual form is 
rapidly increasing. Classification methods play a major role in the 
computerized exploration of such corpora. They can contribute to process 
textual data sets in the three following main domains: 

1 . 1 Producing visualizations and/or groupings of elements (free responses in 
marketing and socioeconomic surveys, discourses, scientific abstracts, patents, 
broadcast news, financial and economic reports, literary texts, etc.); looking for 
associations and patterns (exploratory context of Text Mining, whose ultimate 
aim is to extract knowledge from large bodies of unstructured textual data). 

1.2 Devising decision aids for attributing a text to an author or a period; 
choosing a document within a database; coding information expressed in 
natural language. 

1.3 Helping to achieve more technical or upstream contributions, such as 
lexical disambiguation, parsing, selection of statistical units, description of 
semantic graphs, speech and optical character recognition. 

Note that cluster analysis has been involved in these applications ever since the 
beginning of these investigations (see: Jardine and van Rijsbergen, 1971; 
Willet, 1988). 
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2. Complexity and specific features of textual data 

2.1 The concepts of variables and observations are more complex than those 
usually dealt with in most statistical applications. Variables, instead of being 
defined a priori, are derived from the text. Examples of (categorical) variables 
are the following text units: words, lemmas, segments (sequences of words 
appearing with a certain frequency). In the following, we use the term word to 
designate the textual unit under consideration. In lexicometry terminology, 
word is synonym of type, as opposed to token, which designates a particular 
occurrence of a type. 

Statistical units (or: observations, subjects, individuals, examples) are generally 
documents (described by their titles or abstracts) in documentary databases, 
respondents (described by their responses to open questions) in surveys, or 
segments of texts (sentences, context units, paragraphs) in literary applications. 
Besides the respondents, a second level of statistical units is constituted by the 
occurrences of words. Some statistical tests may involve counts of occurrences, 
whereas others deal with counts of documents or respondents. Such duality is 
often a source of difficulty and misunderstanding. 

2.2 Two additional characteristics increase the complexity of the basic data 
tables: These tables are large (thousands of documents, thousands of words), 
often sparse (a document contains a relatively small number of words). 

2.3 But the main feature of textual data sets is certainly the enormous amount 
of available meta-data. Every word is allocated several rows in a dictionary. To 
identify without ambiguity the lemma associated with a word may often require 
the help of a sophisticated syntactic analyzer. Rules of grammar, semantic 
networks, obviously constitute basic meta-information. 

2.4 Eventually, let us remind that we are dealing with sequences of occurrences 
(or: strings) of items, whose order could be of importance, another non standard 
situation in multidimensional data analysis. Data analysts are accustomed to 
dealing with rectangular arrays of nominal, ordinal, or numerical variables. In 
textual data analysis, the basic data cannot be reduced to such an array. Let us 
consider the case of the responses of n individuals to an open question. 

If { Si, S 2 , , Sv } designates a set of v different elements (the vocabulary, i. e. 

the set of v different words, in the present case), an individual i (i < n) will be 
characterized by an ordered sequence, with variable length y(i): { Sr(i,i), Sr(i,2)? 
•• ? Sr(i,Y(i))}, where 1 < r(i,k) < v, and 1 < k < 7 (i). 

Note that a word can appear several times in a sequence. r(i,k) is thus the index 
of the k^h word in the response of individual i. The first task of any 
classification method is then to compute similarities between such sequences 
(with variable length) of ordered items with repetition (see below section 4.1). 




467 



3. Clustering of observations and words 

3.1 Observations (responses, documents) 

The starting point is to consider each observation as described by its lexical 
profile, i.e. by a vector that contains the frequency of all the selected units in 
the text (these units could be words, at the outset). In many cases, a textual data 
set can lead to building a (n,v) contingency table X whose general entry (i,j) is 
the number of occurrences of word j in the text (observation) i. X can be easily 
derived from the non-rectangular array whose general entry is r(i,k) (as defined 
in section 2.4), but the converse is not true: the information relative to the order 
of the words within a response is lost in X. In most applications, the array r(i,k) 
is actually much more compact than X; thus, a response i containing y(i)=20 
occurrences out of a vocabulary of 2000 words corresponds to a row r(i,k) of 
length 20, and to a row x(i,j) of length 2000. A clustering algorithm involving 
computations of distances directly from the data table, such as the k-means 
method, can easily be reformulated using r(i,k) instead of x(i,j) in order to take 
advantage of the sparsity of X. 

Note that from a given corpus, many different contingency tables can be built, 
according to various thresholds of frequencies for the words. 

However, an usual classification algorithm applied to the rows of X could lead 
to poor or misleading results. As mentioned above, the matrix X could be very 
sparse, many pairs of rows could have no element at all in common (the 
computational advantage of sparsity is then pointless). Moreover, available 
meta-data need to be taken into account (syntactic relationships, semantic 
networks, external corpus and lexicons, etc.) as well as the order of occurrences 
within each response or text. Section 4 below will suggest some possible 
improvements. 

3.2 Words (columns of matrix X) 

Classification of words is rarely the final outcome of a text analysis. It is 
however an important intermediate step, allowing for the definition of new 
statistical units, thence improving the similarities between observations (see 
also: section 4). 

At the lower levels of a hierarchy, one can find textual co-occurrences within 
sentences, as well as fixed length text segments and paragraphs. The search for 
preferential associations is an important factor in applications involving natural 
language processing (see, e.g.: Lewis and Croft, 1990). It can help to solve 
some disambiguation problems useful, for example, in the recognition phase 
following optical scanning of characters. 

Note that non-symmetrical measures of local associations between words (e.g.: 
mutual information index I(x,y) resulting from the Information Theory of 
Shannon, as proposed by Church and Hanks (1990)) entail difficulties with 
classical clustering algorithms. 
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The main patterns observed in the first principal subspaces spanned by the first 
principal axes of a Singular Values Decomposition (or of a correspondence 
analysis) of matrix X are generally marked out by the higher level clusters 
produced by hierarchical clustering. 



4. Enhancing the similarities 

4.1 Taking into account the order of items 

The use of additional units, such as repeated segments (Salem, 1984), can 
partially enrich the data arrays with information about order of items within 
texts. The repeated segment approach deals with the blind and automated 
detection of repeated sequences of words within a given corpus, whether or not 
these sequences constitute frozen phrases or expressions. The principles of a 
fast algorithm able to uncover such segments are given in Lebart and Salem 
(1997). 

Direct measures of distances (such as the Levenshtein distance) have been 
specifically devised for strings (see, e.g.: Coggins, 1983). In this context, 
clustering has proved to be a crucial step when trying to extract generative 
grammars from a corpus of strings. 

4.2 Using syntactic information 

Additional variables obtained from a morpho- syntactic analyzer can be 
instrumental in making more meaningful the distances between lexical profiles 
of observations (responses or texts). The main idea is to tag the words 
depending on their category (nouns, verbs, preposition, etc.), and to 
complement the p-vector associated to each response or text with these new 
components of a different nature (see, e.g.: Biber, 1995; Habert and Salem, 
1995, Habert etal., 1997). 

4.3 Semantic relationships 

The semantic information defined over the pairs of statistical units (words, 
lemmas) is summarized by a graph that can lead to a specific metric structure. 

a)The semantic graph can be constructed from an external source of 
information (e.g.: a dictionary of synonyms, a thesaurus). In such a .^se, a 
preliminary lemmatization of the text must be performed. A practical way of 
taking into account the semantic neighboui consists in complementing the 
words of a given response with their semantic neighbours (provided with 
inferior weights). This leads to the transformation: Y = X(I + aM), where I is 
the unit matrix, M the binary matrix associate with the semantic graph defined 
previously, and a a diffusion weight ca. rating the importance given to 
semantic neighbourhoods. Due to this semantic contamination, Y is less sparse 
than X. The induced metric defined by the matrix: Q = (I+aM)^ leads to a new 
similarity index that can be used for the classification of the subjects. 
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b) The semantic graph can also be built up according to the associations 
observed in an external reference corpus, or within the corpus itself (see: Becue 
and Lebart, 1996). Descriptions of semantic relationships between words 
through self-organizing maps have been suggested by Ritter and Kohonen 
(1989). A hierarchical classification of words (characterized by their associates 
in a thesaurus), complemented with a principal axes visualization of the main 
nodes, produces also satisfactory descriptions of such huge graphs. 

c) Another way of deriving a matrix M from the data themselves is to 
perform a hierarchical classification of words (described by their neighbours in 
a reference corpus) and to cut the dendrogram at a low level of the index. It can 
either provides a graph associated with a partition, or a more general weighted 
graph if the nested set of partitions (corresponding to the values of the index 
less than a fixed threshold) is taken into account. 



5. Discriminant analysis and information retrieval (IR) 

In this context, there are several outside sources of information that can be 
called upon to resolve classification problems: syntactic analyzers, preliminary 
steps toward gaining an understanding of the search, dictionaries or semantic 
networks to lemmatize and eliminate ambiguities within the search, and 
possibly artificial corpora that resort to experts. The major discriminant 
analysis (DA) methods that are suited to large matrices of qualitative data are: 
DA under conditional independence, DA by direct density estimation, DA by 
the method of nearest neighbours, DA on principal coordinates, neural 
networks. 

5.1 Regularized discriminant analysis and latent semantic indexing 
Regularization technique strive to make discriminant analysis possible in cases 
that statisticians deem to be "poorly posed" (hardly more individuals than 
variables) or "ill posed" (fewer individuals than variables). Correspondence 
analysis makes it possible to replace qualitative variables (presence or 
frequency of a word) with numeric variables (the values of principal 
coordinates), and thus to apply classical DA (linear or quadratic). It thus serves 
as a "bridge" between textual data (that are qualitative, and often sparse) and 
the usual methods of DA. But most important, a filtering of information is 
accomplished by dropping the last principal coordinates. This process 
strengthens the predictive power of the procedure as a whole (Wold, 1976). 
These properties are applied in Information Retrieval. For instance Deerwester 
et al. (1990) suggest, under the name of Latent Semantic Indexing (LSI) using 
preliminary filtering through singular value decomposition, which is the basis 
of both correspondence analysis and principal components analysis. For a 
recent review of several filtering methods, including LSI, see: Yang (1995). 
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5.2 Textual units and discriminant analysis 

The nature and the quality of discriminant analysis can be caused to vary 
depending upon the choice of the basic variables and the combination of 
methods: 

a) Words chosen can be enhanced with counts of segments. 

b) Words, and/or segments? can be selected with the help of a previously 
established frequency threshold. 

c) The text can be lemmatized (with or without elimination of function words) 
and enriched with syntactic categories. 

d) Only the words (segments, lemmas, etc.) that characterize the groups to be 
discriminated can be selected beforehand. 

On the basis of the working vocabulary thus created it is possible to: 

e) proceed to a preliminary singular value decomposition (or to a 
correspondence analysis) of the table X (words x observations), and keep 
only the first axes (filtering and regularizing through SVD). 

f) proceed to a preliminary cluster analysis, and work on aggregates of units 
(Hearst and Pedersen, 1996). In Lebart (1992), clusters are used to take into 
account variations of density within each category. The approaches of 
Salton and Me Gill (1983), Iwayama and Tokunaga (1995), in the 
framework of discriminant analysis (named in this context ’Text 
Categorization”) consist of a combination of clustering and discrimination: 
in a preliminary phase, clusters are built to mark out the vector space 
containing the observations, and to limit the number of comparisons of 
distances during the categorization or assignment step. 

All of these alternatives imply a strategy to be developed by the user. Different 
strategies do exist in the framework of learning theory for using combination of 
methods (such as stacking, bagging, boosting; for a review in the specific 
context of IR, see: Hull and al, 1996). 
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Abstract: In the general context of Knowledge Discovery, specific tech- 

niques, called Text Mining techniques, are necessary to extract informa- 
tion from unstructured textual data. The extracted information can then 
be used for the classification of the content of large textual bases. In this 
paper, we present two examples of information that can be automatically 
extracted from text collections: probabilistic associations of key- words and 
prototypical document instances. The Natural Language Processing (NLP) 
tools necessary for such extractions are also presented. 

Key-words: text mining, knowledge discovery, natural language process- 
ing 



1. Introduction 

The general purpose of Knowledge Discovery is to "extract implicit, pre- 
viously unknown, and potentially useful information from data" (Frawley, 
1991). Due to the continuous growth of the volume of electronic data cur- 
rently available, automated knowledge extraction techniques become al- 
ways more necessary to valorize the huge amounts of data stored in the 
information systems. In addition, as the usual Data Mining techniques are 
essentially designed to operate on structured databases, specific techniques, 
called Text Mining techniques, have to be developed to process the impor- 
tant part of the available information that can be found in unstructured 
textual form. 

Text Mining (TM) therefore corresponds to the extension of the more tradi- 
tional Data Mining approach to unstructured textual data and is concerned 
with various tasks such as extraction of information implicitly contained in 
collections of documents or similarity-based structuring and visualisation 
of large sets of texts. 

In sections 2 and 3, we present two examples of information extraction using 
different TM techniques: automated key-word association extraction, and 
prototypical document mining. In section 4, a discussion on the relation 
between these TM techniques and other Textual Data Analysis techniques 
is presented. 
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2. Mining for associations 

In the case of an indexed document collection, the indexing structures (usu- 
ally key-word sets) can be used as a basis for information extraction. 

In such a framework, one possible goal is to extract significant key-word 
associations. 

Let’s consider a set of key-words A = {wi,W2, and a collection of 

indexed documents T = {^i, ^2, •••? ^n} ^ach U is associated with a 
subset of A denoted U{A)). 

Let ly C A be a set of key-words, the set of all documents t inT such that 
W C t{A) will be called the covering set for W and denoted [W]. 

Any pair (H^, ly), where C A is a set of key- words and w G A\W^ will 
be called an association rule (or simply an association), and denoted 
(W^w). 

Given an association rule R : (W => w), 

• S{R, T) = \[WU {u;}]| is called the support of R with respect to the 
collection T {\X\ denotes the size of the set X) 

• C{R, T) = is called the confidence of R with respect to the 

collection T. 

Notice that C{R^T) is an approximation (maximum likelihood esti- 
mate) of the conditional probability for a text of being indexed by 
the key- word w if it is already indexed by the key- word set W. 

An association rule R generated from a collection of texts T is said to 
satisfy support and confidence constraints a and 7 if 

5 (i?,T)>aand C(i?,T)>7 

To simplify notations, [W U {w}] will be often written [Ww] and a rule 
R : {W w) satisfying given support and confidence constraints will be 
simply written as: 



W=>w S{R,T)/C{R,T) 

Informally, for an association rule {W => w), such (j/7 constraints can be 
interpreted as: there exists a significant number of documents (at least a), 
for which being related to the topic characterised by the key- word set W 
implies (with a conditional probability estimated by 7) to be also related 
to the topic characterised by the key- word w. 
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As far as the actual association extraction is concerned, the common pro- 
cedures are usually two-steps algorithms: 

• generation of all the key-word sets with support at least equal to a 
{i.e. all the key- word sets W such that \[W]\ > a). The generated 
key- word sets are called the frequent sets (or cr-covers); 

• generation of all the association rules that can be derived from the 
produced frequent sets and that satisfy the confidence constraint 7 . 

The frequent sets are obtained by incremental algorithms that explore the 
possible key- word subsets, starting from the frequent singletons {i.e. the 
{it;} such as |[{it^}]| > cr}) and iteratively adding only those key- words that 
produce new frequent sets. This step is the most computationally expensive 
(exponential in the worst case) in the extraction procedure. 

The associations derived from a frequent set W are then obtained by gener- 
ating all the implications of the form fT\{u;} w,{w e W), and keeping 
only the ones satisfying the confidence constraint 7 . 

Some additional treatment (structural or statistical pruning, redundancy 
elimination) is usually added to the extraction procedure in order to reduce 
the number of generated associations. 

Association-based Text Mining techniques have been explored by R. Feld- 
man (Feldman, 1996) with the KDT (Knowledge Discovery in Texts) tool 
on the Reuter corpus. This corpus is a newswire collection containing about 
22.000 articles manually indexed with 135 categories form the Economics 
domain by Reuters Ltd. and Carnegie Group Inc. in 1987. 

For such a corpus, association extraction from the key-word sets allows to 
satisfy information needs expressed by queries. The result of the extraction 
process is a list of associations, ordered by support and confidence. An 
example of queries and associated results is given in Table 1. 



Table 1 : Examples of associations 



query 

result 


: ^find all associations between a set of countries including Iran 

and any person' 

: [Iran, Nicaragua, Usa] => PcCagan 6/1.000 


query' 


: ^find all associations between a set of topics including Gold 




and any country' 


result 


: [gold, copper] => Canada 5/0.556 




[gold, silver] USA 18/0.692 
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3. Mining for prototypical documents 

Another direction of research for automated information extraction is to ap- 
ply knowledge discovery techniques to the complete textual content of the 
documents (in a so-called "full text" approach as opposed to approaches 
only considering indexing key- words). However, our experiments on the 
Reuter corpus (Rajman, 1997) have shown that the extraction process does 
not produce any exploitable results when the standard association extrac- 
tion techniques are directly applied on the words contained in the docu- 
ments instead of operating on the already abstract concepts represented by 
the key-words. Among the extracted associations, some only indicate the 
presence of domain-dependent compounds {{wall} => street, (government 
prime} => minister, (treasury secretary james} ^ baker), while others are 
simply uninterpretable {{dollars shares exchange total commission stake} 
=> securities, (million april management lead issues underwriting denomi- 
nations} => selling). 

A different approach is therefore necessary when full text is considered: 
prototypical document extraction. A prototypical document is informally 
defined as a document corresponding to an information that occurs in a 
repetitive fashion in the document collection, i.e. a document representing 
a class of similar documents in the textual base. 

The extraction techniques operating in this framework still use the notion of 
frequent sets, but additional Natural Language (NL) techniques are used to 
preprocess the data, and identify more significant linguistic entities {terms) 
for the frequent set extraction process. 

More precisely, the NL preprocessing, realized in collaboration with R. Feld- 
man’s team at Bar Ilan University, was decomposed into two steps: Part- 
of-Speech tagging and term extraction. 

Part-of- Speech tagging 

This process automatically identifies the morpho-syntactic categories (noun, 
verb, adjective, . . . ) of words in the documents. Such a tagging allows to 
filter non-significant words on the basis of their morpho-syntactic category. 
In our experiments, we used a rule-based tagger designed by E. Brill (Brill, 
1992), and restricted the extraction process to operate only on nouns, ad- 
jectives and verbs. 

Term extraction 

This process aims at the identification of the domain-dependent compounds. 
It allows the mining process to focus on more meaningful co-occurrences, 
and can be decomposed into: 

• term candidates identification (on the basis of structural linguistic 
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information; in our case, morpho-syntactic patterns such as Noun 
Noun, Noun of Noun,...); 

• term candidates filtering (based on statistical relevance scoring (Daille, 
1994)). 

For example, the already mentioned sequence of words (found by associa- 
tion extraction on full text) (Treasury, Secretary, James, Baker) was tagged 
as Treasury /Noun Secretary /Noun James/ Noun Baker /Noun and subse- 
quently identified as one single term Treasury _ Secretary _ James __ Baker/ Noun. 

Mining process 

On the basis of the terms resulting from the NL preprocessing step, an 
algorithm similar to the one described in the previous section was used to 
extract frequent term sets from the document collection. 

The extracted frequent sets were then submitted to several additional treat- 
ments in order to determine the prototypical documents: 

• To reduce information redundancy, a clustering of the frequent term 
sets was performed, on the basis of a similarity measure derived from 
the number of common terms in the sets. The resulting clusters were 
represented by the union of their constitutive term sets. 

• To limit the possible meaning shifts due to variations in the word 
ordering, the clusters were further split into sets of term sequences 
associated with paragraphs boundaries in the original documents. 

An example of the treatments on the Reuter corpus is given in Figure 1. 

Figure 1: The frequent sets processing 

Some frequent term sets (with their frequency) extracted from the 
Reuter Corpus for a support of 80 : 

{due available management priced issuing denominations payment _date} 87 
{due management issuing denominations luxembourg payment _date} 81 
{due management priced issuing combined paying underwriting} 80 
{due management selling priced issuing listed} 81 
{due priced issuing combined denominations payment_date} 80 
{management issuing combined underwriting payment _ date} 80 
(...) 

Resulting cluster: 

{due available management priced issuing combined denominations listed un- 
derwriting luxembourg payment _ date paying} 45 

Most frequent sequential decomposition: 

(issuing due paying priced) (available denominations listed luxembourg) (pay- 
ment _ date) (management underwriting combined) 41 
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Prototypical documents are then all the documents (or document parts) 
that instantiate any of the extracted sequential decompositions of the fre- 
quent term set. An example of a prototypical document instantiating the 
above mentioned decomposition is shown in Figure 2. 

figure 2: A prototypical document 

<DOC 2088 > 

Nissan_Motor_Co_Ltd “NSAN.T” is issuing a 35 billion _yen eurobond 
due March_25 1992 paying 5-l/8_percent and priced at 103-3/8, Nikko_Se- 
curities_Co ( Europe ) Ltd said. 

The non-callable_issue is available in denominations of one_million Yen 
and will be listed in Luxembourg. 

The payment __ date is March 25. 

The selling_ concession is l-l/4_percent while management and underwri- 
ting combined pays 5/8 percent. 

Nikko said it was still completing the syndicate. 



By definition, these prototypical documents are representative of classes 
of repetitive document structures in the collection of texts. Their main 
advantage is to provide a usable interpretation scheme for the information 
extracted from the document collection in the form of frequent term sets, 
and, as such, they constitute good candidates for a partial synthesis of the 
information content hidden in a textual base. 



4. Related Work 

Several other domains concerned with Textual Data Processing (such as 
Textual Data Analysis or Content Analysis) can provide interesting insights 
on the techniques presented in this paper. 

The problem of frequent set extraction could be for instance partially re- 
lated to the identification of co-occurrent words (Lafon, 1981), repeated 
segments (Salem, 1987), or quasi-segments (Becue, 1993), often considered 
in the domain of Textual Data Analysis. The main difference here is that 
the Text Mining techniques rely on the use of frequencies of sets of words 
instead of considering co-frequencies of pairs. 

As far as more sophisticated information extraction is concerned, meth- 
ods used in Textual Data Analysis (Lebart, 1998) usually rely on a cluster 
analysis based on the the chi-square distance between the lexical profiles. 
For each of the resulting clusters of documents, characteristic words (La- 
fon, 1980) (i.e. words with a frequency in the cluster significantly higher 
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than the one expected according to a predefined probabilistic model) are 
then extracted. Each of the cluster is then represented by a characteris- 
tic document which is the document in the cluster that contains the most 
characteristic words. 

The differences between such approaches and prototypical document ex- 
traction as described in this paper are essentially of two kinds: (1) proto- 
typical document extraction integrates a more substantial amount of ex- 
plicit linguistic knowledge, in particular in the preprocessing phase, where 
morpho-syntactic patterns are used for the extraction of indexing terms; 
(2) the aims underlying the two methods are in fact quite different: doc- 
uments characteristic for a cluster identify the information content that is 
the more discriminant for the cluster relatively to the rest of the document 
collection. On the opposite, prototypical documents tend to identify repet- 
itive patterns of texts particularly frequent in the document collection, and 
that will serve to structure its informational content. 

The two approaches therefore appear to be rather complementary in the 
sense that prototypical documents could be thought as kinds of linguistic 
frames in which the informational content (as identified by the character- 
istic documents) could be preferentially expressed. 

In addition, in order to allow better representativity, a more generic rep- 
resentation could be achieved by using name entity tagging, a semantic 
tagging that allows to identify and generalise certain elements of a sen- 
tence. Such a tagging could lead to representations where the variable 
parts of the prototypical documents would be replaced by concepts. 

For instance, on the baisis of the results of name entity tagging applied 
to the document given in Figure 2 (these results were produced by the 
Alembic tool and provided by Christopher Clifton, from the MITRE NLP 
group (MITRE, 1997)), the associated document class could be represented 
by the generic prototypical document presented in Figure 3. 

Figure 3: A generic prototypical document 

<ORGANIZATION> is issuing a <NUMBER> yen eurobond due 
<DATE> paying <NUMBER> percent and priced at <NUMBER> , 
<ORGANIZATION> ( <LOCATION> ) said. 

The non-callable issue is available in denominations of <NUMBER> and 
will be listed in <LOCATION>. 

The payment date is <DATE>. 

The selling concession is <NUMBER> percent while management and un- 
derwriting combined pays <NUMBER> percent. 

<ORGANIZATION> said it was still completing the syndicate. 
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5. Conclusion 

We have presented two examples of Text Mining tasks for the extraction 
of information from collections of textual data: an association extraction 
method operating on indexed documents, and a prototypical document 
extraction algorithm that can be applied on plain documents (full text 
approach). For both tasks, preliminary results have been obtained. Further 
research will be carried out to explore the use of prototypical documents 
for the automated synthesis of the information content of document classes 
in large collections of textual data. 
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Multivariate and Multidimensional Data 
Analysis 



• Proximity Analysis and Multidimensional Scaling 

• Multivariate Data Analysis 

• Non Linear Data Analysis 



• Multiway Data Analysis 
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Abstract: The latitude of acceptance (Sherif & Hovland, 1961; Coombs, 

1964) is considered as the cornerstone of social judgement theory and attitude 
measurement (Eagly & Chaiken, 1 993). Luo ( 1 998a) introduced a general form 
of the unidimensional unfolding models in which the latitude of acceptance is 
explicitly parameterised. This paper generalises this form further into a 
multidimensional model. It is shown that the structure of multidimensional 
unfolding models is defined by the operation function and the magnitude of the 
latitude of acceptance. It is also demonstrated that some well referenced 
multidimensional scaling models are special cases of the general form introduced 
in this paper. According to this general form, some new multidimensional 
unfolding models can be readily specified. This general form provides a 
systematic understanding of the multidimensional unfolding models, and for 
applications, it reveals the possibility of developing a general algorithm for the 
estimation of the parameters of multidimensional unfolding models. 

Key words: Unfolding model. Latitude of acceptance. Multidimensional Scaling. 



1. Introduction 

Coombs (1964) classified the procedures of data collection in attitude 
measurement into 4 types: Preference choice, Single stimulus, Stimulus 
comparison and Similarities. The single stimulus method, which is the simplest, 
requires respondents to respond positively or negatively to a stimulus (item). 
Data collected in this way is dichotomous: Agree - 1 and Disagree: 0. Various 
probabilistic unfolding models have been developed to analyse this type of data. 
These models can be classified into two classes: unidimensional and 
multidimensional. Unidimensional unfolding models parameterise item locations 
and person locations into a unidimensional real continuum (e. g., DeSarbo & 
Hoffman, 1986; Andrich, 1988; Hoijtink, 1990; Andrich & Luo, 1993; Andrich, 
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1996) and multidimensional unfolding models parameterise item locations and 
person locations into a multidimensional real space (e. g. Rao, 1986; DeSarbo & 
Rao, 1986). Luo (1998a) synthesised unidimensional probabilistic unfolding 
models by introducing a general formulation for this class of unfolding models. 
This paper generalises this general formulation into multidimensional situations. 



2. The general form of multidimensional unfolding models 

Luo ( 1 998a) introduced a general form of the unidimensional unfolding models 
as 






^(P) 



( 1 ) 



where 'F, which is termed the operational function, has the following 
properties: 

(PI). Always positive: 'P(r) > 0 for any real t; 

(P2). Monotonic in the positive domain: 'P(^i) > ^(^ 2 ) any tj> t 2 >0; 
and 

(P3). Symmetric about the origin: 4^(~0 = ^P(0 for any real t. 

The three parameters of the model of Equation (1) are: 

5 1 : the location parameter of item 

: the location parameter of person « =1,..., iV; and 
p : the latitude of acceptance parameter, which can be subscripted as an 
item parameter pi or a person parameter p„ . 



Graphically, S-±p are the interception points of the curve = l}and 
P{Xni = 0} • The significance of Equation (1) is that it parameterises the latitude 
of acceptance, a well known but little researched concept (Sherif & Hovland, 
1961; Coombs, 1964). 

For the multidimensional cases, which are studied in the theory of 
Multidimensional Scaling (Borg & Groenen, 1997), the item and person location 
parameters have K coordinates. 



^,x); 

Pn=(Pnl’Pn2^-^PnK)- 
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The difference Pn~^i Equation (1) can be replaced by their distance 
between and in a K-dimension space 






( 2 ) 



Then the general form for multidimensional unfolding models is 



/’{x„, = l} = 



y(p) 

'P(p) + 'P(llj3„-5,.ll)' 



(3) 



3. Special cases of the general form 

1 . The DeSarbo-Hoffman model 

DeSarbo and Hoffman (1986) introduced the Spatial Choice Model 

^{x„, = i) = ^ ^ ; (4) 

l + exp(£w„*(^„,.-4)'+c„) 



where was termed the threshold and was called the salience or 
importance of person n for dimension k. Equation (4) is equivalent to the 
following form 



^{x„,=l} = 



f j~K 

exp(( y ) + exp( £ - 5,* f 

V V ^=1 



(5) 



Comparing equation(5) with equation (3) leads to 



P = 4^n’ 

2. The DeSarbo-Rao model 

Let Obe the normal distribution A^(0,CJ^), DeSarbo and Rao (1986) defined a 
multidimensional unfolding modes as 
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P{x„, = 1 1 p„,5,,cj = (6) 

k=\ 

Obviously, when = 5-, Equation (6) has the maximum value of O(c^); when 

~K 

^(Pnk -^ikf = we have 



/>{x„, - 1 1 J3„.5„cj = <5(0) = 0.5 = P(x„,. = 0 1 )3„,5,,c„} 



( 7 ) 



That is, is the latitude of acceptance parameter in the form of equation (3). 
To decide the expression of the corresponding operational function 
equating equation (3) ( = 1) to equation (6) leads to 









( 8 ) 






and 






k=l 






( 9 ) 






Therefore, 



Let 






i-^(C„-t.iPr,k-pkP) 




nt) = 






( 10 ) 



( 11 ) 



then it is easy to demonstrate that the properties (PI), (P2) and (P3) are 
satisfied. The minimum point of 'P is at t =0: 



4 ^( 0 ) = 



l-0(cj 



( 12 ) 




487 



Because ) = 1 , the model of equation (6) can be re-expressed as 






(13) 






4. Some new multidimensional unfolding models generated according to 
the general form 

1. MDHCM: Multidimensional version of the HCM (Andrich & Luo, 1993) 

cosh(p) 



/’{x„, = l) = 



cosh(p) + cosh( 1^03^ - S,^ f ) 



(14) 



2. MDPARELLA: Multidimensional version of the PARELLA (Hoijtink, 
1990). 






^{x„ = l) = - 






(15) 



where y >0 is a structural parameter specifying the power of the operational 
function 4^(/) = 



5. Conclusions 

The general form of multidimensional unfolding models of this paper provides a 
systematic understanding of the multidimensional unfolding models^ According 
to the general formulation of multidimensional unfolding models, P{x^. = 1} is 
single-peaked function of person-item distance in the multidimensional space. 
Therefore, all the results from unidimensional unfolding modelling, which is 
based on the single-peakedness of its response function, are potentially 
applicable to the multidimensional form. For purpose of application, it reveals 
the possibility of developing a general algorithm for the estimation of the 
parameters of multidimensional unfolding models. This general formulation can 
also be extended to polytomous situations using the Rating approach or 
Sequential process (Luo, 1 998b). 
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Abstract: Optimal scaling is reviewed in the framework of a comprehensive 
multidimensional data analysis strategy, called the Data Theory Scaling Sys- 
tem (DTSS). The optimal scaling methodology has a lot of potential for the 
analysis of multivariate data that are qualitative, non-normally distributed, 
or incomplete. Also, the system can easily deal with nonlinear relationships 
between the variables to be analyzed. A crucial aspect of DTSS is the graph- 
ical display of the results of the analysis. The most recent developments of 
DTSS include the integration of optimal scaling with combinatorial data 
analysis, e.g., ordering tasks in one dimension, and the fitting of graphs to 
multivariate data. 

Keywords: multivariate analysis, categorical data, optimal scaling, uni- 
dimensional scaling, multidimensional scaling, combinatorial optimization, 
ordering, consensus, networks, graphs. 



1. Introduction 

Classical statistical methods need to be adapted in various ways to suit the 
particular characteristics of data obtained in the social and behavioral sci- 
ences. The latter are often data that are non-numerical, with measurements 
recorded on scales that have an uncertain unit of measurement. Data would 
typically consist of qualitative or categorical variables that describe the units 
(objects, subjects, individuals) in a limited number of categories. The zero 
point of these scales is uncertain, the relationships among the different cat- 
egories is often unknown, and although frequently it can be assumed that 
the categories are ordered, their mutual distances would still be unknown. 
The uncertainty in the unit of measurement is not just a matter of mea- 
surement error, because its variability may have a systematic component. 
An important development in multidimensional data analysis has been the 
optimal assignment of quantitative values to such qualitative scales (optimal 
scaling), where we distinguish in addition to the traditional interval level • 
the ordinal scaling level, taking only rank-orders (among categories) into ac- 
count, and using least squares monotonic regression or monotonic regression 
splines, and • the nominal scaling level, taking only categorical informa- 
tion into account, and using rank=l regression (lea,st squares or regression 
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splines) or the centroids approach (rank=p regression, where p denotes the 
chosen dimensionality in the optimization process). A variable is represented 
by a set of category points; rank=p regression locates a category point in the 
center of gravity of the associated objects; rank=l regression fits category 
points on a straight line through the origin. 



2. Optimal Scaling: A General Data Analysis Strategy 

A review focussed on the optimal scaling paradigm needs to address its his- 
tory. The idea of optimal scaling originates with different sources. Looking 
at the history of optimal scaling with the rank=p approach, famous early 
contributions are by Horst (1935), Fisher (1940), Guttman (1941), Burt 
(1950), and Hayashi (1952). This approach to categorical data analysis 
leads eventually to (multiple) correspondence analysis (Greenacre, 1984), 
a literal translation of Benzecri’s “analyse des correspondances (multiple)”. 
The class of techniques is also known as dual scaling (Nishisato, 1980; 1994), 
and homogeneity analysis (Gift, 1981, 1990). A recent discussion of homo- 
geneity analysis from the psychometric point of view can be found in Heiser 
and Meulman (1994). A second, and perhaps the major impetus to optimal 
scaling found its origin in the area of nonmetric multidimensional scaling, 
as pioneered by Shepard (1962) and Kruskal (1964). In nonmetric multidi- 
mensional scaling, optimal scaling of the proximities is typically performed 
by a monotonic (rank=l) regression. After its remarkable accomplishments 
in MDS, optimal scaling was soon to be incorporated in techniques for mul- 
tivariate analysis, notably by Kruskal (1965), Shepard (1966), and Roskam 
(1968). In the 1970s and 1980s psychometric contributions to the area be- 
came numerous; attempts at systematization resulted in the ALSOS system 
by Young, De Leeuw and Takane (1976, 1978), Young (1981), and the system 
by the Leiden “Albert Gifi” group. The Albert Gift (1990) book “Nonlinear 
Multivariate Analysis” aimed to provide a comprehensive system, combin- 
ing optimal scaling with multivariate analysis, including statistical develop- 
ments from the 1970s and 1980s. Since the middle 1980s, the principles of 
optimal scaling have gradually appeared in the mainstream statistical liter- 
ature (Breiman and Friedman, 1985; Gilula and Haberman, 1988; Ramsay, 
1988; Buja, 1990; Heistie, Tibshirani, and Buja, 1994). The Gifi system 
is discussed among the traditional statistical techniques in Krzanowski and 
Marriott (1994). In the 1990s, optimal scaling methods have been extended 
into a more general framework, called the “Data Theory Scaling System”; 
its mission is to meet typical concerns in the social and behavioral sciences, 
both from a substantive perspective as well a,s the technical point of view, 
dealing a.o. with • discrete multivariate data • ordinal data • incomplete 
data • nonlinear relationships between pairs of variables • non-normal dis- 
tributions • ordering/scaling of response patterns • social network data and 
other proximity relations. DTSS focusses on the multivariate analysis of 
qualitative data, including: 

• dimension reduction by linear mapping (see, e.g., Heiser and Meulman, 
1995); the Gifi system is confined to this aspect of DTSS; 
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• distance approximation in multivariate data analysis (Meulman, 1992); 

• clustering of objects (Heiser, 1993; Heiser and Groenen, 1997), or 

• clustering of variables (Meulman and Verboon, 1993; Meulman, 1997); 

• graphical display of objects and variables in linear biplots as in Tucker 
(1960), and Gabriel (1971), optimized for least squares multidimensional 
scaling of multivariate data (Meulman, 1997), and 

•nonlinear biplots as in Gower and Harding (1991), generalized for least 
squares MDS (Meulman and Heiser, 1993; Groenen and Meulman, 1997); 

• the implementation of procedures in SPSS Categories (the 8.0 version in- 
cludes categorical regression and correspondence analysis with restrictions). 
Hence, DTSS can be viewed as a successor to the Gifi system, being much 
more general. In the sequel of this paper we will discuss the two most recent 
extensions of DTSS: 

• combinatorial data analysis, and 

• fitting of graphs or networks. 



3. Finding An Optimal Ordering of Objects 

Allowing for optimal scaling of the variables in multivariate data was re- 
cently combined with the objective of finding an optimal ordering of the 
objects (Meulman, Hubert and Arabie, 1997). The latter is fundamentally a 
combinatorial problem and cannot be solved by ordinary optimization tech- 
niques used in multidimensional scaling approaches that are based on some 
form of (sub)gradient method. We are considering the analysis of an AT x M 
multivariate data matrix, where N denotes the number of objects in the rows 
of the data matrix and M denotes the number of variables in the columns. 
In the study of dependence among variables, one usually resorts to some 
form of principal components analysis. The results of a principal compo- 
nent analysis have been known for quite some time to display characteristics 
of the objects as well (Gower, 1966), and this is in the form of distances 
that approximate proximities that would be observed if the Euclidean dis- 
tances among the rows of the data matrix were inspected. In Meulman 
(1992), a distance-based PGA was defined through the minimization of a 
least squares loss function, defined on the distances, and using majorization. 
In addition, optimal scaling of the variables was included in the procedure. 
This approach to optimal scaling is basically different from nonmetric MDS, 
because the latter involves optimal scaling of the proximities instead of the 
variables (whereof the proximities are derived). If we wish to find a single 
ordering of the objects, the objective specializes to a one-dimensional scaling 
task. The associated least squares loss function can be written as 

STRESS(Q;x;c) = ||D(Q) + c-Z?(x)|p. (1) 

Here an element of D(-) is a Euclidean distance between two objects; Q de- 
notes the multivariate data matrix, where the columns are separately scaled 
according to the chosen level and method. The optimal scaling of the vari- 
ables is attained by the use of constrained majorization (as in Meulman, 
1992). The vector x gives a one-dimensional scale, with distances as close 
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as possible to the proximities in £)(Q); c is a constant. Given a fixed per- 
mutation finding c and x can be done as in Hubert, Arabie and Meul- 
man (1997), i.e., by an iterative projection approach, adopted from Dykstra 
(1983). Finding the optimal ordering of the objects in x is a combinatorial 
problem. Since gradient or majorization methods have severe local min- 
ima problems when dealing with this task, an alternative strategy was used, 
shown to be successful in Hubert and Arabie (1994), that involves a number 
of local operations. These are: (a) all pairwise interchanges of objects; (b) 
all insertions of k consecutive objects between any two existing objects or 
at the beginning and end of the permutation; (c) all complete reversals of k 
consecutive objects. These operations are cycled through repeatedly until no 
change occurs. The combination of optimal scaling of variables with optimal 
ordering of objects can give a very parsimonous representation of a (large) 
number of qualitative variables. A very suitable application would be the 
situation where different persons have judged different options according to 
a number of different criteria. In the analysis, persons would function as 
the variables, and the result would be an optimal consensus ordering of the 
options. Since distance-based MVA can now be applied to problems that 
are basically combinatorial in nature, we can also consider other represen- 
tations, by analogy with Hubert, Arabie and Meulman (1997). These tasks 
include circular unidimensional structures, additive trees, and ultrametrics. 
These applications are forthcoming. 

4. Fitting Graphs to Multivariate Data 

Another, and not completely unrelated, recent development in DTSS is the 
incorporation of spatial representations other than Euclidean, specifically 
through graphs (networks). In Heiser and Meulman (1997), a prevalent 
type of categorical multivariate data was considered, consisting of binary 
variables. An example of the latter in the domain of psychometrics is the 
collection of scores of examinees on binary items. The aim of the analysis 
would be to order the N examinees on a latent one-dimensional continuum 
(scale), to be constructed from the examinees’ responses on the M binary 
items. This objective can more generally be formulated as to form a partially 
ordered structure among the response profiles. A second example, from nu- 
merical taxonomy, consists of presence/absence scores of M attributes in 
N individuals, where the major task is to construct hierarchical trees. For 
binary data, the information in the N x M multivariate data matrix, can 
be efficiently transformed into a (reduced) profile frequency matrix, where 
a weight is attached to each profile corresponding to the occurrence of the 
row profile in the N x M data matrix. In traditional statistics, binary data 
are very often presented in the form of a multiway contingency table; next, 
models are fitted to the cell counts, with respect to the marginals of the ta- 
ble. Currently, the most popular method to analyze this type of categorical 
data is loglinear analysis. An alternative analysis could be a multiple corre- 
spondence analysis of the profile frequency matrix. Geometrically, the row 
profiles can be regarded as the 2^ corners of a hypercube, where each corner 
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of the hypercube is associated with a mass corresponding to the occurrence 
of a profile. Multiple correspondence analysis then aims at a simplified rep- 
resentation of the set of profiles. In contrast to the prevailing belief that 
loglinear analysis is radically different from multiple correspondence analy- 
sis because the latter would ignore higher-order interactions, Meulman and 
Heiser (1997) have shown that distances in the graphical display in mul- 
tiple correspondence analysis are inverse functions of the odd-ratios that 
express higher-order interactions. The scaling approach presented in Heiser 
and Meulman (1997) is a very different spatial alternative to loglinear anal- 
ysis. In contrast to the graphical models, as reviewed in Whittaker (1990), 
focussing on the representation of variables^ the objective is to develop a 
graph model for the response patterns, i.e., for a group of objects^ having 
the particular profile. The cell frequencies in the multiway contingency ta- 
ble are associated with distances between vertices in a graph. The Heiser 
and Meulman (1997) method is based on the equivalence between the Ham- 
ming distance among the corner points of the hypercube (which is a distance 
of the city-block type computed from binary coordinates) and the shortest 
path-length distance in a specific type of graph. In the Heiser and Meul- 
man (1997) paper, a weighted least squares method was developed to fit the 
Hamming distances in the graph to the Hamming distances derived from 
the data, with weights composed from the pairwise product of the associ- 
ated profile frequencies. Thus, the following least squares loss function was 
minimized: 



2M 

a{B) = ( 2 ) 

i<j 

where pi and pj are the profile frequencies of profiles a,- and aj , respectively, 
the dij{A) are the distances between the profiles, and the d,j(B) are the dis- 
tances in the graph. An important feature of the Hamming distance is the 
additivity of distances along the shortest path in the graph (Heiser, 1997). 
Additivity of distances induces the feature of ‘‘betweenness”: the distance 
between two points b, and bj is equal to the sum of the distances between 
bt and bj to an intermediate point bjt. Apart from selecting a best fitting 
graph, the Heiser and Meulman (1997) method was extended with a clus- 
tering feature (see Heiser and Meulman, 1998). When many corners of the 
hypercube represent only a few objects (profiles with low frequencies), it is 
useful to allow profile points to be clustered in the representation. There- 
fore, the observed profiles have to be allocated to a limited number (K) of 
vertices in the graph. It can be shown that the resulting multidimensional 
scaling problem that forms the core of the method, is now only of order 
A", rather than A, as in ordinary multidimensional scaling of multivariate 
data, or 2^, the number of vertices in the hypercube, which is equal to the 
maximum number of rows in the profile frequency matrix (as in Groenen, 
Commandeur, and Meulman, 1997). 
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5. Discussion 

The integration of the optimal scaling paradigm in state-of-the-art multidi- 
mensional data analysis offers a wealth of exciting new methods to explore 
qualitative multivariate data that goes beyond the “nonlinear multivariate 
analysis” techniques as advocated in Gift (1990). Specifically, the matching 
of optimal scaling with combinatorial data analysis, and the fitting of graphs 
or networks opens new avenues for modeling systematic relationships among 
the individuals or objects in data from the social and behavioral sciences. 
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Abstract: New identification and similarity models are derived from a unified 
viewpoint based on maximum information entropy principle. Relationships 
among new models and some existing models, especially Ashby's general 
recognition theory (GRT) model, are also clarified. 
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1. Introduction 

Models for stimulus identification data have been studied and developed as one 
of the most important and attractive problems in psychology, especially in the 
fields of mathematical psychology and psychometrics. This is perhaps because 
stimulus identification is closely related to other basic psychological concepts 
such as similarity, categorization, preference, and signal detection. 

There are at least two types of identification models incorporating 
multidimensional representation; the MDS choice model (Nosofsky, 1991; 
Takane & Shibayama, 1992) and the GRT (general recognition theory) 
identification model (Ashby & Perrin, 1988). I, recently, developed an 
identification model called the MIP (maximum information principle) 
identification model (Miyano, 1997), and discussed two issues on these two 
models. One was on the assumption of choice rule employed in the MDS 
choice model. The other was on the relationship between the MDS choice 
model and the GRT identification model. As a result, it was shown that the 
MIP model could derive the MDS choice model without assuming the Luce’s 
choice rule, and clarified a simple relation between the MDS choice and GRT 
identification models. 

As noted above, the concept of identification is closely related to similarity, 
since it is often assumed that the greater the similarity between a pair of 
stimuli, the more likely one will be confused with the other (e.g.. Luce, 1963; 
Tversky & Gati, 1982). Although this does not mean that similarity 
(confusability) is the only factor affecting confusability (similarity), it seems 
reasonable to derive similarity model from identification model , as shown by 
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Ashby & Perrin (1988). 

In this paper, we first give a brief summary of the MIP identification model, 
and then discuss on a problem how it can be used to explain the perceived 
similarity. 



2. Identification Model 



Consider the case of a complete stimulus identification task, where subjects are 
asked to identify given stimuli. Let =1,2,K ,m be a stimulus, R. be a 

unique response corresponding to 5^, and P(R.\Sj) be the conditional 

probability with which stimulus Sj is identified as stimulus S ^ . According to 

the MDS choice model, P{R^ \ Sj) is given by 

w.cxp(-dt) 
m\Sj )= — 






where dj^ is the distance between Sj 



( 1 ) 



and 5; in some multidimensional 



psychological space, w, is the bias associated with , and a is a positive 
parameter. The MDS choice model is a multidimensional representation of the 
so-called similarity choice model based on the Luce’s choice rule. 

In the GRT identification model, P{R^ \ Sj ) is given by 



P{R,\Sj) = lf.(x)dx, 



( 2 ) 



where fj is the probability density function of S- being represented as x in 
r-dimensional space U R\ and r;. is the spatial region in U associated with 
response Rj . The GRT model is a multidimensional extension of Thurstone 
model, and outperforms the MDS choice model, while the latter shows 
considerably good performance for some identification data (Nosofsky, 1991). 

As shown in this section, the MIP identification model gives a simple 
explanation of the relation between the MDS choice and GRT models. The 
MIP model postulates that stimulus identification process is a two stage 
process; the first stage is the transduction process which transduces external 
information into mental representation, and the second is the decision process 
which determines the response to the mental representation. 

The transduction process is modeled under the following two assumptions. 

(A-1). The percepts of any single presentation of stimulus can be 
represented as a point x in r-dimensional space U R\ 

(A-2). The probability of stimulus being represented as x in f/ is 
determined by a penalty function 6.{x-x.) and the maximum information 
entropy principle subject to the constraint. 
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f^P(^ I *5y)^(x - ^j)dx = (3) 

where kj is a constant inherent to stimulus Sj. Penalty ftinction^^.(x) is a 
non-negative convex function which satisfies (1) ^y(x) = 0 iff x = 0 , and (2) 

^y(x) =^y(“X). 

Based on these assumptions, we can derive the probability of 5y being 
represented as x in U.LqI p(x | Sj) be the probability density of x when Sj 
is presented. Then, our problem is to find p(x|5y) which maximizes 
information entropy ip , 

V' = 1 5y)logp(x I Sj)dx, (4) 

subject to the constraint (3). 

This problem can be solved easily by using a Lag*ange method; that is, we 
obtain 



p(x|S^.)=- 



exp(-/?,.<9/x-Xj)) 



( 5 ) 



where Pj is aLag-ange parameter determined by (3). 

Likewise, the decision process can be modeled by using the maximum 
information entropy principle. That is, the response probability P{R^ \ x) can 
be derived by maximizing the entropy, 

x) log^(^ I x), (6) 

i-1 

under the constraint 

(x - X.) - AlogbJ = constant, (2) 

i-l 

where b. is the response bias for and A is a weighting parameter to adjust 
the effects of the biases; that is, if we assume ^ is a Lagange parameter 
corresponding to (7), then we obtain 

bf^exp(-M(x-x,)) 



m|x)=- 



(8) 



S.i^r®xp(-M(x-x,)) 

where 0,(x - x,. ) = p.d. (x - x,) . 

The constraint (7), successfully introduced by Buhmann et al (1993) in 
their clustering algorithm, controls the randomness of the response to x . From 
(5) and (8), the probability of when Sj is presented is given by 

m I S,) - -f )) S (9) 

where Cj is a normalization term defined by 

Cj = jr exp(-^^.0^.(x- x^.))£?x. 



(10) 
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It is easy to see that at a very large , p(x \Sj) = S(x -Xj), and at a very 
large p, the response becomes deterministic; P{R^ | x) = 1 if for any k 

^j(x ~ X- ) - A logfo. < X logfc^ , and P(R^ | x) = 0, otherwise. The 

latter property on P{R^ \ x) gives the same decision bound as that of the GRT 
choice model defined by the likelihood equation p{\ | Sj) / p{\ 1 5^) = 1 . In fact, 
if we assume that ^6^ = ^2 = K = and bj = then the decision bound 

defined by ^.(x - x, ) - A logfo, < 6^{x-x^)- X logfc^ coincides with that of the 
GRT choice model. 

Hence, we can assert the following properties. 

(1) If yS is very large, then the MIP identification model is equivalent to the 
GRT identification model. 

(2) If p. is very large, then the MIP identification model is equivalent to the 
MDS identification model. 

(3) Shepard-Luce choice rule and exponential similarity function are not 
indispensable assumptions to the MDS choice model. 



3. Similarity Model 



As assumed by Ashby et,al (1988), we employ an assumption that perceived 
similarity is proportional to the probability of a confusion. This implies that 
the perceived similarity s{S^,S^) of to is defined by 

= — / exp(-^,(9„(x (11) 

where 



m(x;x„,x^,^ = 



fcrexp(-^^,(x-xj) 

exp(-^^,(x - xj) + exp(-^0j(x - xj) ' 



( 12 ) 



This model corresponds to the similarity model proposed by Ashy et.al, an 
extended version of the GRT similarity model (see Appendix A of Ashby & 
Perrin, 1988). Suppose, for convenience, that the perceptual mean is 
g'eater than the mean , then their model for the unidimensional case is 



given by 






(13) 



where f^{x) (/^) is the probability distribution of the perceptual effects 
elicited by (5^), x^ is a point at which the likelihood /^//^ is 1.0, and g is 



a smooth continuous function such as a normal cumulative distribution with 
mean x^ and variance a^. Note that if the 5^ mean is smaller than the mean, 
then g must be replaced in (13) by 1 - g . 




501 



For comparison, let us consider the unidimensional case of (11), and assume 
that the penalty functions are quadratic; that is, 

^bf Then, u{x;x^,x^J) is given by 



u{x;x^,Xf^,^) = 



1 



(14) 



^+Kb^^pi-^d^bi^-xo)y 

where k^^ = {bjb^f^, d^b ^'^{^a~^b)^ and x^ ={x^+ x^)!2. From this 



result, we may expect that the MIP similarity model has the same properties 
as the extended GRT similarity model, since the shape of logistic function is 
very similar to the normal distribution. 

Hence, we can assert that our similarity model is a natural realization of the 
above GRT similarity model, and p acts like the unit of measurement which 
will become small when subjects judge the similarities among stimuli never 
being confused. 



4. Conclusions 

In this paper, we have shown that the MDS choice model can be derived 
without assuming the Luce's choice rule and exponential similarity function, 
and that there exists a close relationship between the MDS choice model and 
the GRT identification model. It also has been demonstrated that our MIP 
identification model gives a natural extension of the GRT similarity model. 

While the general recognition theory is well known as a very powerful and 
general theory of recognition, we believe that our approach based on the 
maximum information entropy at least will contribute to the further 
development of that theory. 
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Abstract;^ In this paper we suggest a non parametric generalization of the 
Mahalanobis distance which enables to take into account the differing spread 
of the data in the different directions. The output is an easy to handle metric 
which can be conveniently used both in an exploratory stage of the analysis for 
the detection of multivariate outHers and successively as a tool for non para- 
metric discriminant analysis, multidimensional scaHng and cluster analysis. 
In addition, the use of this metric can provide information about multivariate 
transformations and multiple outhers. 

Keywords: Mahalanobis distance, equidistance contours, convex huU peel- 
ing, 5- spline smoothing. 

1 Introduction 

The distance concept plays an important role in many topics of multivariate 
analysis. One of the most widely used distance measures is the one of Ma- 
halanobis (^m) (e.g. Dasgupta, 1993). This distance is appropriate for use 
in sample spaces where there exist differential variances and correlations be- 
tween variables. In this metric, contours of equi- distance in the p- dimensional 
space are p-dimensional hyper ellipsoids. Consequently, when we use ( 1 m we 
implicitly assume that the spread of the data in the different directions is 
symmetric. Therefore, in presence of highly asymmetric data the use of this 
distance seems questionable. More generally, in a preUminary stage of the 
analysis it seems preferable to use a metric which does not assume an un- 
derlying distribution. Finally, even if the elHpticity hypothesis is satisfied, 
in order to construct ( 1 m we still have to face the problem of estimating the 
means of the variables and the covariance matrix. The breakdown point of 
(1m is 0 and the presence of multiple outhers can cause masking and swamping 
problems (e.g. Barnett and Lewis, 1994). 

The purpose of this paper is to suggest a simple generahzation of ( 1 m which: 
(a) is non parametric, (6) is robust to the presence of atypical observations, (c) 
keeps into account the differing spread of the data in the different directions, 
(d) reduces to the usual ( 1 m when the eUipticity hypothesis is satisfied and the 
data are not contaminated. 

^The piogiam for computing the non parametric distance suggested in this paper is 
available upon request. We can be contacted at statdueQipruniv.cce.imipr.it or 
zaniQipruniv.cce.unipr.it. The authors wish to thank Aldo Corbellini for program- 
ming help. 
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2 Description of the method 

First let us consider the case in which the number of variables (p) is equal 
to 2. Our approach starts defining a non parametric bivariate central region 
and a robust centroid. As pointed out by Riani et ah (1998), a natural and 
completely non parametric way for defining a central region in is through 
the use of the so called convex hull peefing. Successive convex hulls are peeled 
until the first one is obtained which includes not more than 50% of the data 
(and so asymptotically half of the data). This 50%-huU is smoothed using a 
5-spHne, constructed from cubic polynomial pieces, which uses the vertices of 
the 50%-huU to provide information about the location of the knots. (From 
now on this spHne will be called 50%-spline). Zani et al (1998) discuss 
several choices of a robust centroid. In this work we use the intersection 
of the two least squares lines built with the observations which lie inside 
or at the boundary of the 50%-spHne. As an illustration of the suggested 
approach let us consider the data referred to 160 European regions reported in 
Figure 1. On the x-axis we have the Index Numbers of GNP per inhabitant. 



Figure 1: Bivariate boxplot of Unemployment rate in % versus Index Numbers 
of GNP per inhabitant for 160 European regions (50% and 99% contours). 




PPS (EUR 15=100). On the t/-axis we have the unemployment rate in % 
(Source: Eurostat, REGIO, 1996). Figure 1 also reports the 50%-spline. As 
emerges clearly from the plot, the spread of the data in the differing directions 
is different, therefore the traditional approach based on djvf does not seem to 
be appropriate. Two straight lines in correspondence of the bivariate centroid 
have also been drawn. Zani et ah, (1998) show that in order to obtain an outer 
contour which under the assumption of bivariate normality leaves outside a 
percentage of observations close to 1%, we must multiply the distance from the 
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50%-spline to the robust center by 1.68. Using this coefficient we obtained 
the outermost contour reported in Figure 1. The units which lie outside 
the bivariate contour can be interpreted as atypical. The 50%-spfine and the 
outermost contour can be interpreted as non parametric equidistance contours 
from the center. In this way we can take into account non parametrically 
the differing spread of the data. This leads us to define a new metric. Let 
us consider separately the distance of one point from the centroid and the 
distance between two generic points. 

Distance from the centroid in JR^ 

In our metric the observations which lie on the 50%-spline have the same 
distance from the centroid. In order to define a measure unit, without loss of 
generality we can set equal to 1 the distance from the center of a point which 
lies on the 50%-sphne. The distance of every other observation can be based 
on the former unit of measure. For example, in Figure 2 let us consider the 

Figure 2: Example of computation of the distance from the robust centroid 
for 4 of the European regions reported in Figure 1 (Standardized variables). 




-2 0 2 4 

GNP per inhabitant (EUR 15-100) 

straight line OA passing through the robust centroid and point A. Let point 
A' be the intersection of segment OA with the 50%-spHne. The distance OA 
in this case is 2.55 times the distance OA'. Therefore in our metric point A fies 
at a distance 2.55 from the origin. In general: the distance from the centroid 
of every point K depends on the ratio between OK and OH where H is the 
intersection of the straight line passing through OK with the 50%-spfine. In 
Figure 2 units A e B respectively correspond to the regions Extremadura 
(E) and Uusimaa (FIN). Note that using the standardized Euclidean distance 
point B would be much closer to the center than point A. Using our metric 
the ratio OB j OB’ is 2.50. This implies that if we take into account the 
differing spread of the data in the different directions these two regions have 
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Figure 3: Plot of the points of Figure 2 in the transformed space using the 
suggested metric (The two superimposed circumferences define equidistant 
contours from the robust centroid). 




-2 0 2 4 

GNP per inhabitant (EUR 15=100) 



approximately the same distance. Following similar arguments we can claim 
that point C (Ceuta y MeHUa, E) has a distance from the centroid (3.32) 
which is exactly equal to that of point D (Bremen, D). 

In Table 1 we compare for units C and D our non parametric distance 

{djiz) from the centroid with the Euchdean one (d) (using standardized vari- 
ables) and the one of Mahalanobis (djvf). This table shows that for units A 
and C the values of daz a-nd djvf are smaller than those produced by the Eu- 
chdean distance. The opposite happens for units B and D. In fact d^z ^.nd 
dM correctly keep into account the negative correlation (r = —0.359) between 
the two variables. The values of our metric compared to dj^ are smaller for 
units A and C and greater for the two remaining units. Our metric is based 
on non parametric contours and enables to reduce (increase) the distance for 
the units located in the directions where the spread of the data is high (low). 
The assumption of dj^ of symmetric spread usually is not met in practise. 
For example, in the data reported in Figure 1 it is clear that the spread of 
the data in the South-West direction is much more evident than that in the 
North-East direction. The eUipses (here not reported for lack of space) which 
represent the equidistance contours in the Mahalanobis metric, treat in the 
same way the units located South-West with those located North-East. This 
explains why for example, the value of duz for unit D is much bigger than 
that reported by dM- 

Remark 1: If the underlying hypothesis of eUiptic distribution is true the 
50%-spHne tends to become an ellipse (Atkinson and Riani, 1997). In addi- 
tion, convex hull peehng is invariant under Hnear transformations of the data. 
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Table 1: Squared distances from the centroid of four regions using different 
metrics. 





^2 

^RZ 






A: Extremadura 


6.528 


11.070 


13.041 


B: Uusimaa 


6.275 


2.064 


1.331 


C: Ceuta y MeHUa 


11.050 


14.139 


14.961 


D: Bremen 


11.038 


5.573 


4.935 



Therefore the suggested metric simply reduces to the (Im when the ellipticity 
hypothesis is satisfied and the data are not contaminated. 

Remark 2: In order to calculate the former distances, in our metric we sim- 
ply need an estimate of the centroid but we do not have to estimate either 
dispersion or correlation parameters. 

Remark 3: It is possible to use a different spline contour (for example a 60%- 
spline or 75%-spline). With this choice the shape of the inner region may 
adapt more to the spread of the units lying far from the centroid. However 
this results in a decrease of robustness. 

Figure 3 shows the 160 European regions of Figure 1 after eHminating non 
parametrically the differing spred of the data in the various directions. In 
this new space both unit C and D He on the same circumference centered 
on the robust centroid whose radius is 3.32. Similarly, units A and B Ue 
approximately on the circumference whose radius is 2.5. 

Distance between two points in IR^ 

In our metric the splines which define the equidistant contours are transformed 
into circles. This implies, for example, that the units which have a distance 
equal to 3.32 from the robust center after the transformation lie on a circle 
with radius 3.32. In the transformed space (Figure 3) EucHdean distances 
referring to different directions are directly comparable because we have re- 
moved (non parametrically) the different spread of the data in the various 
directions. Therefore our metric is equal to the Euclidean distance between 
two points in the transformed space. 

Extensions to p-dimensional data 

In presence of p-dimensional data the situation becomes more complicated 
because we cannot rely on the graphical representation anymore. With the 
purpose of defining a distance measure which, with well behaved data, reduces 
to the d>M we have to consider pairs of variables such that the sum of the 
marginal bivariate d,M is equal to the overall c?m using the whole set of p- 
variables. It is possible to show that the p- variate (1m can be expressed as 
the sum of p/2 bivariate (1m if P is even and as the sum of [p/2] bivariate 
(1m plus a univariate djvf if p is odd. For example if p = 3, (considering 
for simplicity the variables in standardized form) using the matrix inversion 
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lemma it is possible to prove that the squared Mahalanobis distance of the 
trivariate vector x = z)' from 0 can be decomposed as: 






2p^yXy [^\/l - Ply - 






- + 



ryz 



ypy- 



J 



ISI 



( 1 ) 

where S is the correlation matrix and pji,k denotes the partial correlation 
coefficient between variables j and I given variable k. 

The first term on the right hand side of equation (1) is nothing but the squared 
marginal dM between the first two variables. The second term can be shown 
to be the squared dj^ from 0 of the univariate variable z\[x^y)\ 






\{pxz pxypyz)^ H~ (Pyz pxypxz)y\ 



^-Ply 



( 2 ) 



In the case of a univariate random variable the central region is given by 
the interquartile range and the robust center is the median. Similarly to the 
bivariate case, we can set equal to 1 the distance between the first (third) 
quartile and the median and we can use this length for defining the other 
distances. With 3 variables, therefore, we initially have to compute our dis- 
tance using the first two variables, then we have to consider the transformed 
variable z\[x^y). If the eUipticity hypothesis is satisfied and the data are not 
contaminated the simple sum of these two distances reduces to the global . 
With elliptic distributions it is immaterial the order in which the variables 
are considered. Finally, in presence of asymmetric data we end up with a 
metric which adapts non parametrically to the various spread of the data in 
the different directions. 
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Abstract: This paper obtains the disparities using concepts of isotonic 
regression applied to the Euclidean distance and to nonmetric multidimensional 
scaling model. The disparities are shown given to be by the slopes of the 
Greatest Minorant Convex (GMC), and are calculated by example. This justifies 
their inclusion in the goodness of fit (stress), which measures the discrepancies 
between the initial and fitted configuration, and it is used for procedures 
(ALSCAL and MDS) which fit the model. 

Keywords : Nonmetric multidimensional scaling. Isotonic regression. 

Disparities, Goodness of fit. Stress. 



1. Disparities as slopes of segments of Greatest Minorant 
Convex 

Given a set I of n objects and A a dissimilarity matrix, S being defined on Q. 
The order over A induces an order over Q, and in turn over / , so that the pair 
"(/,y) is less dissimilar than (/',/)”, and is shown as (/,y) <(/',/) if, and only if, 

< 6{i\j') V (z,y) e Q. (1) 

Given the order "<" over Q, an initial configuration is obtained. From it the 
nonmetric euclidean MDS model obtains the coordinates of n points 
^ euclidean space of p reduced dimension, p<n, so that the 
distances between each pair of points of configuration, d^j, are approximated as 
much as possible to the dissimilarity between the pairs of objects, Sy . That is, 
if the distances between points of configuration are a monotone transformation 
of dissimilarities between objects, then 






( 2 ) 
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This function which changes dissimilarities into distances is monotone, non 
fixed before fitting the model. That is d.j = for all [ij) eQ, f being a 
priori an unknown monotone function. In practice, dissimilarities given on an 
ordinal scale cannot be perfectly transformed into distances by a monotone 
transformation, so that (2) is not usually satisfied. A measure of degree in 
which this relation is not reached will be defined from disparity function, d, 
which must verify 

if d{i,j) < S{i',f) then dy < d^j. V (/,y) € Q (3) 

and differs as little as possible from the distances between points of 
configuration by method of Least Squares. 

This paper shows that disparity function is given by the isotonic regression, d , 
of a distance function d. The isotonic regression of distance function is defined 
from basic concepts of isotonic regression: Isotonic function on X, Class of 
Isotonic functions. Dual Cone, Cumulative Sum Diagram (CSD), Greatest 
Minorant Convex (GMC), etc. which were formulated and adapted to the 
problem of nonmetric scaling in Rivas (1989) from the work of Barlow, 
Bartholomew, Bremmer, Brunk ( 1 972). 

From these concepts, the two following theorems prove that a necessary and 
sufficient condition for cf to be an isotonic regression of distance d is that d 
has to be the function which associates each xe X with the slope of segment 
whose endpoint is this point of GMC. 

Theorem 1. Necessary Condition 

If d is isotonic regression of d, then it is the slope of GMC. That is, it is the 
function which minimizes the length of all chords joining the origin to endpoint 
of CSD, and lying below the CSD. 

The length of a segment of chord joining two consecutive points and 

in the euclidean space is given as 

hj = s (4) 

The following problem must be solved 




( 5 ) 
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in the class of functions satisfying: a) F^j <D,^ for all (ij) eQ, that is, the 
chord starts from and passes through or below each point P^j ; and b) 
^in - ^n-\,n thc chord passcs through „ of CSD 

This is changed into an equivalent problem associated with the convex cone of 
a new set of fimctions/'=d- / on X. Then, (4) must be minimized in the 
class of functions {/:Jf ~^R\ d-f e ;rj), Tul being the dual cone of convex 
cone TTq and the conditions a) and b) are substituted by a’) and b’), 

a') Z Z ^0 (i,y)€Q 

q=i h=j 



b') Z Z = 0 

q=i h=J 



This problem of minimization is solved for any convex function in the 
proposition given in Barlow et al. (1972, p.50 ). Minimization of (5) is 
obtained as a direct consequence of this proposition : d and w are given real 
functions on X, being > Ofor {i,j) e Q . Let ^ be a convex function and a 
determination of its derivative, and tTq being the class of isotonic functions with 
respect to the order induced by S over Q. Then d minimizes the length of 

chords lying below CSD, Z (//+l) the class of 

(ij)eQ 

functions / so that d - f e ;Tq . 



•/. ? must be minimized. 






Given that is the class of functions / such that d - f e nl , \t must be 
shown that d satisfies 



Z (6) 

(».y)eO 



z (7) 

(U)eO 



Given that any real function p verifies 




d 

u 




(Barlow et al. 1972, p. 35), in particular substituting <!>'= P , then (6) is 
satisfied. 
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In addition (7) is satisfied. As d is isotonic and is non decreasing, then 

i,\d] belongs to a class of functions . Given / , d - f e ttI and by 
definition of dual cone the following inequality is given 

Therefore, (^) 

O.y)eo 



Theorem 2. Sufficient condition 



A solution representing the objects of I is given as {X,d). If d is the slope of 
GMC, then J is an isotonic regression of d . That is, any isotonic function on 
X gives 






w- > id — d 






In addition, there is only one isotonic regression. 
Proof : Adding and subtracting d^j from d.j - f-j 

S [f^ij “/y) S ^ij S ~fi^ 

(ij)en (/j)en (ij)en 



The difference between the two terms is 

2,1 (8) 



By the partial summation of Abel's formula (Apostol, 1979) and the properties 
of GMC, this term is greater or equal to 0 for any isotonic function on X 
(proved by Rivas (1989, pp. 174-175)). Therefore 

2 I (rf.. -4)h^. = II [(/:. - -D,.,) 

(/,y)eO /=1 j=2 

+ S [(/+!,, + 2 “/in)~(^i+l,/+2 ~diMPin ~A>.)+(^n-l« ~ fn-\^^n-\n “ A-ln ) 

1 = 1 
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Considering /^oo = ^oo = /oo = ^oo = the last term of the above equality is 0 by 
property (1) of CMC, in that the last points of CSD and CMC are coincident. 
Thus T)„-i„=A-i»and, applying property (2), =0and Ai,> 2 -<^m= 0 , 

then 



~ ^ ^ (-^y fij-\ipij-\ Ay-l)"’" ^{fi+l,i+2 fintpin A>i 

(/,y)en /=1 j=2 /=1 



Again applying property (2) Dy_^ < D.j_^ and b^^ < D.„ , and / being isotonic, in 
that fy-fy_^>0 and yi+ 1 ,+ 2 ^0, then (8) is greater or equal to 0. 

Suppose that another isotonic regression d' on X exists. For any isotonic 
function / must be 

z k- -fiPij^ z k -d'iil ^ij 

(i, 7 )en (/.y)en 



By the previous theorem, given the isotonic function d , 

I k-^v) ^ % must be verified. 

(i.yjen (-j)en (<j)en' 

It is right if 

,iMi ^ ^ 

0,y)en 

but d^j > d\j contradicts d being an isotonic regression of d, then d^ = d\j. 

Based on these results, the known goodness of fit measure of nonmetric MDS 
model is defined. 

The disparities (images of disparity function) have been obtained by application 
of isotonic regression to distances between points of X configuration that 
represent the objects in a space of p dimension. From distance function d on X 
and the order ‘<’ over Q, the isotonic regression of d has given 
d:X -> 7?"^ whose images satisfy 

d{x.Xj)< d{x.^,Xj^^)i : \,...,n-2\j : 2,...,n -1; d{xiX„)< d{x^^^x^^^)i ; -2 

These values preserve a perfect monotone relation with dissimilarities and, 
simultaneously, the differences between disparities and distances are 
minimized. 
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If all the distances between points are different, that is, weight function w is 
identical to 1, a global deviation of isotony is given as 

5( Z, p, a) = (. )) 

This measure is known in nonmetric scaling as raw Stress (Coxon, 1982; pp. 
55-58), and shows the degree to which the fitted configuration represents the 
dissimilarities. But it is not invariant to changes of scale of configuration. To 
avoid this inconvenience, raw Stress is divided by a scaling factor. The most 
usual scaling factors give the measures of goodness of fit known as Stress 
and S^. 

2. Obtaining disparities by an example 

Column 4 gives the fitted distances dy obtained from symmetric dissimilarites 
between pairs of 5 objects, (table 1, Column 1), (Coxon;1982, p.62). The order 
over pairs of objects (z,y) is induced by dissimilarities Sy (Column 2) . In 
Column 3, weight Wy associated with each pair of objects is 1 (because each 
pair has a different dissimilarity). Applying the above concepts, 
Py = (^y ,T)..) for all (/,y) E Q of CSD are obtained (Column 7). 



Table 1 : Points of CSD and CMC and slopes of segments of CMC 



{ij) 




^-j 






Dij 


P, 


du 


% 




4 


(5,1) 


1 


1 


3 


1 


3 


(1,3) 


3 


1 


3 


(1,3) 


(5.1) 


2 


1 


6(*) 


2 


9 


(2,9) 


4.5(*) 


2 


7.5 


(2,7.5) 


(2,1) 


3 


1 


30 


3 


12 


(3,12) 


4.5(*) 


3 


12 


(3,12) 


(3,2) 


4 


1 


5 


4 


17 


(4,17) 


5 


4 


17 


(4,17) 


(4,2) 


5 


1 


8 


5 


25 


(5,25) 


8 


5 


25 


(5,25) 


(3,1) 


6 


1 


10 


6 


35 


(6,35) 


10 


6 


35 


(6,35) 


(4,3) 


7 


1 


13(**) 


7 


48 


(7,48) 


11(**) 


7 


46 


(7,46) 


(5,2) 


8 


1 


11(*») 


8 


59 


(8,59) 


!!(♦*) 


8 


57 


(8,57) 


(5,3) 


9 


1 




9 


68 


(9,68) 


11(**) 


9 


68 


(9,68) 


(4.1) 


10 


1 


15 


10 


83 


(10,83) 


15 


10 


83 


(10,83) 



Pairs of objects (5,l), (2,l), (4,3), (5,2), (5,3) given in Column 4 marked by (*) 
in Table 1, do not satisfy the monotone relation (see Figure 1). Thus, distances 
must be fitted by isotonic regression. 
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The remaining points of CSD satisfy W.j = W.j i).j = D^j, mdthusPy = Py .The 
disparities and new coordinates of points on GMC Py = (^y. A>j given in 
Column 1 1 . 

The CSD is given in Figure 1, plotting Py = {Wy,Dy^, for all (ij) e Q. It can be 
appreciated that the GMC corrects the lack of monotony on the points which do 
not satisfy it when they are placed on the CSD. 



Figure 1. Plot of CSD and GMC 




Wei^ 



The disparities associated with distances which do not satisfy the monotone 
relation have been obtained. The slope of segments associated with distances 6, 

3, 9,13 and 11 lead to the squared minima discrepancies {d^ defining 

the goodness of fit. These measures being = 0,122 and S 2 = 0,28 . 

3. Conclusions 

Contrary to other Statistical models where the theoretical model is developed 
before the algorithm to fit it, in Multidimensional Scaling the algorithms have 
arisen and then the papers studying their characteristics and properties. This 
paper gives statistical support to procedures based on isotonic regression that fit 
this model. It should serve as a basis for further study when a dissimilarity 
matrix contains equal or asymmetric data. It can also be extended to other 




516 



models such as nonmetric individual differences scaling. The usefulness of 
disparity function to preserve the order of data, given on an ordinal scale, by a 
monotone function should be noted. 
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Non-parametric regression models for the conjoint 
analysis of qualitative and quantitative data 
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Abstract: We considered some recent non-parametric regression models to jointly 
analyze qualitative and quantitative data. A comparison of the different methods 
was done from a theoretical points of view, using a unified approach. The 
capabilities of the methods to model a strongly non-linear simulated data-set was 
presented. 

Key words: non-parametric regression, non-linear methods, analysis of mixed 
data, neural networks, projection pursuit regression, MARS, optimal scaling. 



1. Introduction 

The aim of this paper is to analyze some recent non-parametric non-linear methods 
devoted to regression analysis. We are particularly interested in analyzing the 
capabilities of these methods with respect to the conjoint analysis of qualitative 
and quantitative data from empirical and theoretical points of view. The methods 
considered can be used to analyze a large set of variables even with a moderate 
number of statistical units. No hypothesis will be made on the distribution of the 
variables and the analysis will be essentially explorative. 

With a large number of variables we have the problem known as the ’’curse of 
dimensionality”, especially if we want to analyze interactions among variables. To 
overcome this difficulty, all the methods use a set of low dimensional fimctions. 

We will not consider local-methods such as Kernel methods, due to the difficulties 
in applying these methods to high dimensional data with a not very large number 
of units. All the methods analyzed, as will be shown later, can be considered to be 
a special case of the general expression: 

M 

Y = a, + Ya„B„{X,3j+s (1) 

m=l 

where X is a vector of P explicative variables. 

The methods differ with respect to the definition of the set {Bf^} of basis 
functions, this allows a unified view of the different approaches. 



2. Transforming variables inside additive models 

The first approach we consider uses the additive model: 

i=\ 



( 2 ) 
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where a, (/=0,1...P) are parameters to estimate, P is the number of predictors X„ ft 
are non-specified transformation functions. The presence of the parameters a, is 
due to normalization constraints on/. The expression (2) can be obtained from (1) 
setting M=P and using univariate transformation functions. 

In the Optimal Scaling approach (Gifi 1990) with quantitative variables, the/ can 
be spline transformations, meanwhile for qualitative variables we put 
where 8,y are the dummy variables associated to the categories, and 

Yj are parameters to be estimated. 

The analysis of ordinal qualitative data can be easily included in (2) using the 
dummy variables of order that allow the assignment of an increasing coefficient to 
the categories of the ordinal variable (see Di Ciaccio 1988). 

Hastie & Tibshirani (1990) proposed a similar approach, but in a probabilistic 
framework, known as Generalized Additive Model (GAM). Setting 
rj = E(7 /XjX 2 ....Xp), GAM can be written as: 

/(rj) = a„+X«,/;(X,) (3) 

j=l 

where / is a given link function and the parameters a, can be absorbed by the 
functions / if these are not constrained. These functions are estimated by 
maximum likelihood using a Backfitting algorithm, while the Optimal Scaling 
approach uses an alternate least squares algorithm. 

The additive model approaches can include interactions among variables only by 
introducing explicitly interactive terms like f^ 2 (X^,X 2 ) , usually in the simple 

form /i 2(^1 • X 2 ) . This can be done in a hierarchical approach, that allows the 

distinction between the contribution of main effects and interactions. Modelling 
interactions with these approaches is usually quite difficult. 



3. Projection methods 

Some techniques, rather than considering transformations of the original data, use 
transformations of optimal projections of data on low dimension subspaces 
(usually < 2). The only best-known method is the Projection Pursuit Regression 
model (PPR) (Friedman & Stuetzle 1981), which, given one dependent variable Y, 
can be formulated as: 



m I r 

Y=a + Y,aJ„ 



+ e 



(4) 



where /m are normalized one-dimensional smooth functions of the one dimensional 
projection and a=E( Y I X=x ). Obviously, when the number of explanatory 

variables P is high, the PPR model is more parsimonious compared to the additive 
model, which is additive in its predictor effects. Furthermore the PPR can deal 
with interactions of predictors but the interpretation is usually difficult. A smooth 
multiple additive regression technique, SMART, was proposed as a generalization 
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of (4) for K dependent variables (Friedman 1985). Analysis of dichotomous or 
polychotomous explanatory variables could be included in (4) by means of dummy 
variables. The analysis of ordinal qualitative data is not considered in literature but 
it can be easily included in PPR using the dummy variables of order. Using this 
approach we have to impose a limit on the search for the optimal directions y into a 
region where some elements y;>„ are non-negative. 



4. Neural Network 

A well-known form of neural network is the feed forward neural network (NN) 
where nodes are organized in different layers, one for input nodes (the explanatory 
variables), one or more intermediate layers for hidden nodes and one layer for 
output nodes (the dependent variables). Furthermore, the only permissible 
connections are between nodes in consecutive layers and directed upwards. The 
simplest architecture of a feed forward neural network has only one single hidden 
layer and one output node : 



Y 



= 8 



ao + X"™/- 






+ e 



( 5 ) 



where X/ is i-th explanatory variable, Y is the dependent variable, Yim denotes a 
weight for the connection between the /-th explanatory variable and the m-th 
hidden node, and a weight between m-th hidden node and the output node. 
Expression (5) is very general; in fact introducing specific constrains we could 
obtain different non-parametric models (see Cheng, Titterington 1994) like PPR 
(when g is the identity function and/„ are not specified) or GAM (M=P, ybm=0, Yim 
is the Kronecker 8im and g is the identity function). However, while in PPR and 
GAM models we consider M different functions fm determined from the data, in 
neural networks the/^’s are M same functions fixed by the user (generally chosen 
to be monotone, f e. sigmoidal functions or hyperbolic tangent) and g is usually a 
linear or sigmoidal or threshold function. 

Another difference of NN with projection method and other methods based on 
recursive partition is that for NN all of the parameters are estimated simultaneously 
by optimizing some measure of fit to the training data. The goal of fitting depends 
on the right choice of several elements: architecture (number of nodes and layers), 
learning (rate of learning and momentum coefficient), algorithm (generalized delta 
rule, conjugate gradients, quasi-Newton algorithms etc.). 

If we analyze data sets including qualitative variables it is useful to distinguish 
between them with respect to nominal scale or ordinal scale. In the first case, one 
dichotomous input could correspond to each class; in the second case we might 
consider the dummy variables of order as described in the previous paragraph. In 
this case we should impose non-negativity constraints on the weights 
corresponding to the input dummy variables. The connection between NN and PPR 
is not only formal because under mild assumptions they have same property of 
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convergence and approximation to continuous functions on compact subsets of 
(Cybenko, 1989; Jones, 1992). However when the number of input variables is 
high and the variables are highly correlated, PPR turns out to be more 
parsimonious then NN in terms of parameters (Ripley, 1994a). 

Feed forward NN are also widely used in discrimination analysis (Ripley 1996, 
Schumacher et al. 1996). In this case the output variables are dummy indicators of 
classes and a logistic function could be chosen as g to have values of output nodes 
in [0,1]. The number of weights in NN is very relevant so often we obtain weakly 
identifiable solutions. 



5. Methods based on recursive partition regression 

Recursive Partition Regression (RPR) is a powerful method (Breiman, Friedman, 
Olshen, Stone 1984). This approach is very useful when there is a significant 
interaction structure in the predictors. It is based on a binary decision tree where 
knots split the data into two groups which are mostly homogeneous with respect to 
the response. The corresponding CART algorithm arranges a very large tree and by 
cross-validation then prunes it back to a reasonable size. We can write the RPR as: 

i' = i«j(XeRj+e (6) 

m=l 

where I(.) is the indicator function, Rm are disjoint regions of the covariate space 
and the oCm coefficients are estimated by the means of Y in Rm- The fundamental 
limit of recursive partitioning models is the lack of continuity: the model produced 
is piecewise constant and sharply discontinuous at sub-region boundaries. 
Multivariate Adaptive Regression Splines (MARS) (Friedman 1991) can be seen 
as a generalization of the RPR. To span the space of univariate q-order splines we 

can use the basis (one-sided truncated power basis functions) 1, 

{(X - 1^ )^ }^ , where 4 are the knot locations, K indicates the number of knots and 
the subscript + indicates a value of zero for negative values of the argument. In 
particular MARS uses a truncated two-sided power basis with q=l to construct the 
multivariate spline basis functions 

B.(X)=nk(Xv,*,™,-f^)l (7) 

k^\ 

where v(k,m) label points to the explanatory variables used in the k-ih factor of the 
m-th product; Skm takes on a value of +1 or -1 and indicates the (right/left) sense of 
the associated function. With this definition of basis functions, the MARS model 
can be written as in (1). Because basis elements are products of a finite number of 
univariate linear splines, this approach produces a piecewise linear function which 
does not possess continuous derivatives. In the final step, MARS uses the previous 
model to derive an analogous piecewise-cubic basis with continuous first 
derivatives. The algorithm is a stepwise procedure with an automatic selection of 
basis functions. Neural Network, in which the back-propagation optimize 
simultaneously over all basis elements, takes the opposite approach. 
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RPR is invariant with respect to monotone transformations of explanatory variables. 
This relevant property simplifies the analysis of ordinal qualitative variables, which 
can be inserted into the ^alysis by simply assigning a progressive value to the 
categories. We can adopt the same approach with MARS. 

Rearranging the terms of the model we can write MARS as the more interpretable 
expression: 

= + + + ^ ( 8 ) 

K^=3 

MARS was successively extended to analyze qualitative nominal data, considering 
its relation with Recursive Partitioning Regression (RPR) (Friedman 1993). 

Let A/ a subset of modalities of a nominal variable Xj , the basis functions of Xy will 
assume the form I(X;GA/). Then the MARS model with ordinal and nominal 
variables becomes (with q=l): 

Y = Uq + e )pj [S/^^ (^V(yfc,w) ” ^km )J+ ^ (^) 

m=\ l=\ k=\ 



6. Comparison on noisy artificial data 

In this section, an application of the methods to an artificial data-set is presented. In 
this way we can easily evaluate the fit to the underlying generation model. We use 4 
variables: 2 quantitative variables X and Y, 1 categorical variable V with 6 
modalities {a,b,c,d,e,f} and one variable Z defined as 

Z = X + sin(27iXY) if V = a or V = c 

Z = X + log(Y) + cos(7iXY)' if V = b or V = d (10) 

Z = X + log(Y)-4XY if V = e or V = f 

The variables X, Y and V were independently generated from uniform distributions; 
X and Y assume values in the range (0,1). 

Note that, given the category of V, we can represent Z=f(X,YA^) in a 3-dimensional 
space. The first graph of figure 1, the first graph of figure 2 and figure 3, give the 3 
different graphical representations of the function Z=f(X,YA0- Then we extracted a 
sample of 100 observations and added a Normal error giving a rate signal/noise = 
3/1. 

The artificial data obtained shows a strong interaction among all the explanatory 
variables, although we have a small problem with only 3 explicative (mixed) 
variables which can be examined graphically conditioned on the category of V. On 
the other hand we have few perturbed observations (15-18) for each category of V 
and it is very difficult to discover the true underlying functions by means of a simple 
graphical analysis. The function f(X,Y,V) considered in the artificial data is quite 
“smooth”. This is a requirement in a non-parametric approach: the only way a non- 
parametric method has to distinguish between signal and noise is to suppose 
smoothness of the signal. Function f(X,Y,V) should varies smoothly while 
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z > X + imm +«Man*x*v)'*n 
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Figure 3 

I Z»X+iSn<fi,38*X*Tf) 




Figure 4 

11M» Db, Vv 



- 0^5 



- 1.75 



noise should not. A non-parametric method is effective if it can well approximate 
the smooth true function filtering the rough noise. 

We try applying Projection Pursuit, Neural Networks and MARS, which are able 
to model interactions among variables, to this data. The application of the 
feedforward NN required to search an appropriate architecture to connect the 8 
input nodes (two input nodes for X, Y and six dichotomous inputs for the 
categories of V) to the output node (perturbed Z). We found an optimal solution 
with a fully connected network with 3 hidden layers having 5, 4, 3 nodes 
respectively, i.e. 88 parameters. The optimal solution for PPR was obtained for 
M=3, while the optimal solution of MARS includes 6 basis functions and the 
interactions (X,V), (Y,V), (X,Y,V). In Figure 1 and 2 we show, with perspective 
plots, the true f(X,YW=b) and f(X,YA/=e) with the corresponding estimated 
functions for each method. Examining these graphs we can note a better 
performance of MARS on our data-set. The other two methods are equivalent and 
less effective than MARS, although the simpler and more parsimonious PPR 
model should be preferred to NN. 

We considered also a larger data-set with 1000 observations to evaluate the 
performance of the methods on larger sample. Applying NN to this data, we used a 
more complex feed forward NN with 10 nodes for the first hidden layer, 8 nodes 
for the second and 4 for the third (219 parameters). With this larger sample, NN 
increases the capability to identify the true underlying function. The fit of 
Z=f(X,YW=e) represented in Figure 4 show a clear improvement while the 
increased sample size is not so useful with the other two methods. 



7. Conclusion 

Every non-parametric method we have analyzed has some weakness and some 
capabilities which we observe theoretically and which we find again by empirical 
means, analyzing simulation data. NN represents an interesting and general class of 
non-parametric methods, but it is difficult to interpret, requires much training data 
and some experience to avoid local minima. MARS is a very powerful method, 
particularly suitable to analyzing interactions between variables. However, for high 
complex non-additive data with very large samples it can be outperformed by NN. 
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This method can also be more advantageous than MARS in presence of outliers or 
a high degree of multicollinearity in the predictors, always with a very large data- 
set. PPR is the simplest and most interpretable model if the variables have high 
correlations and we obtain M<3. The Optimal Scaling methods and GAM are 
usually outperformed by the previous methods with complex data-sets but in the 
case that variables have low interaction levels these methods can have an 
interesting interpretability. 
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Abstract: We present two methods to explain linearly a set variables Y by an- 
other set X through a neighbourhood relationship matrix, with non-independent 
observations. We establish a link between these two methods and the regression 
framework with autocorrelated error. 

Keywords: Local analysis; Multivariate analysis; Partial Least Square; Principal 
Component Analysis with respect to Instrumental Variables. 



1 Introduction 

Explaining one or more dependent response variables (Vi, I 2 • • * by severals 
variables , A 2 • • • Ap is common in such various fields as biology, behaviourial 
or social sciences. Numerous articles or books are devoted to both theoretical 
(e.g. Rao and Toutenburg, 1995) and applied (e.g. Draper and Smith, 1981 or 
Raining, 1990) aspects of this vast subject. For example Multivariate linear mod- 
els is an attempt to solve this sort of problems using linear combination of the 
dependent variables to explain Y. From this linear point of view it is possible to 
derive numerous methods like Partial Least Squares (PLS) regression (e.g. Wold, 
1985) or Principal Component Analysis with respect to Instrumental Variables 
(PCAJV). An extensive bibliography of this latter one can be found in van der 
Burg and de Leeuw (1990) including the historical paper of Rao (1964). We 
focus our interest on non-independent case of these two methods. Although clas- 
sical framework enables us to tackle non-independence, through non diagonal 
variance matrices of error, we introduce a different approach of modeling non- 
independence between observations. In the following we shall write n for the 
number of theses observations. 



2 Euclidean framev^ork 

Following Lebart (1969) we introduce a matrix M, such that the element Mij re- 
flect the neighbouring relationship between the observation i and the observation 
j. These coefficients are chosen by the user to belong to [0, 1]. For instance if 
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data are measured at different depth (in a lake for instance) we can use a linear 
boolean neighbourhood matrix constructed as shown below : 

Figure 1 : Linear neighbourhood 



neighbour 




non neighbour 



Other more complex structures of this matrix M can be found in Cliff and Ord 
( 1 98 1 ), for instance the coefficients M, j could be function of the distance between 
observations. 

We adopt in the present paper the geometrical approach using metrics to compute 
distances between objects and variables, see for instance Escoufier (1987). 

Let y be the n X ^ data matrix, Q 2 l q x q symmetric definite positive ma- 
trix defining a scalar product on the space of dependent variables, and D = 
diag{pi , P 2 , • ’ ' ? Pn ) a n-dimensional diagonal matrix with positive element sum- 
ming to 1. Recall that the empirical total variance of the dependent variable 
can be written as: 

viy.k) = lEpiPAyik-Yjkf. 

If we are just interested in the local structure of variations, we can introduce 
the neighbourhood structure defining the local variance (Lebart, 1969, Banet and 
Lebart, 1984) which is closely related to the Geary index (Geary, 1954): 

VL{y.k) = \T.P^P,M„{Yi,-Y,,)\ 

hJ 

From this we can also write the local covariance as: 

covLiYk, Yi) = W PiPoMi, {Yk - Yjk) (Yu - Yj,). 

^ hj 

Within this framework we shall redefine PLS regression and PCAIV on a local 
point of view in the following sections. Finally we shall establish a correspon- 
dence between this approach and a regression one using autocorrelated errors. 
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3 Local PLS regression 

Partial least squares regression is a method of analysis useful for designs with 
more explanatory variables than samples. It can easily deal with multicollinearity 
in the matrix X. The aim is to predict the response by a model based on linear 
transformations of the explanatory variables, and to construct a regression model 
of the following type: 

where m is the dimension of the model, 

Yj is the prediction of the variable of Y 

The latent variables Zi • are linear combinations of the explanatory vari- 
ables Xi - • • Xp, They are chosen such that they are uncorrelated to each other. 
Because we are only interested in local variations, we search latent variables lo- 
cally uncorrelated to each other. For completeness we can also define another 
scalar product in the vector space IR^ of observations of the independent vari- 
ables by a definite positive matrix N, We can write the local PLS regression 
procedure as follows: 

Table 1: Local PLS model of dimension m 
^ y and X 

for k — Horn 
Zk <= Xu^'°K with 

y('=)'£>(£)* _ = Au^, 

where D* — diag{p*) and p* = X!j=i Mijpj, 

and A is the first (biggest) eigen vaJue of the operator. 

(H„ - Zk{Zk'D{D* - MD)Zk)-^Zk')X 
^ (H„ - Zk{Zk'D{D* - MD)Zk)-^Zk')Y 

end for 



It can be easily seen that local PLS regression is just a special case of PLS regres- 
sion with a new scalar product in the space IR"" of variables. The matrix defining 
this scalar product is written as D{D* — MD) and take into account the neigh- 
bourhood relationship between the observations. 

The dimension m of the model can be checked out by using a cross validation 
procedure. One observation (one line, for instance the line) is taken out from 
X and Y. The model at stage k is then fitted to this reduced data set. The squared 
difference between the predicted value according to model dimension k, Y^!l])j 
and the actual value of variable of Y is added to the PRESS{k). The 
total sum of squares of predictions minus observations for every deleted line is a 
measure of predictive power : 

PRESS^,^, = '-±■£{Y„-Y^%)^ 

i=l j=l 






528 



The size m of the model for variable is then chosen to achieve the minimum 

of PRES S^k)j 

We can apply this method to the Californian data set (Upton and Fingleton, 1985, 
p.273). These data relate to islands and to mainland localities along the Califor- 
nian coast. The variables measured were: Y the number of species, Xi the area 
in squares miles, X 2 the maximum elevation in feet and X 3 the latitude (degrees 
North). An ordinary PLS of size 1 re-expressed as an ordinary regression leads 
to the following model: 

1 ^( 1 ) a\X\ -f- OL2X2 + OL3X3. 

The parameter estimates are: 

Table 2: parameter estimates of PLS model fitted to the Californian data 

oi 55 

0.379 0.495 0.374 



We can use the ’two-part distance-based’ weight matrix M defined in Upton and 
Fingleton (p.291) as the squared inverse of the distance between two location up 
to a cut-off distance (taken to be 187.5 miles). Beyond this limit weights are 
taken to be zero. These weights are then normalized to have the sum of each line 
equal to 1. A Geary test can be performed on the residuals of PLS regression. It 
shows that they are not independent (p- value < 0.01) and a local PLS regression 
could be used to modelize this fact. The fitted model becomes: 

Table 3: parameter estimates of local PLS model fitted to the Californian data 

a\ Si 0^3 

0.434 0.484 0.065 



We can see that the influence of latitude is far less important. The residuals take 
into account the location dependency and the fitted model no longer considers 
the latitude as a good explanatory variable. A look on the PRESS’ variations 
confirms our selection of an unidimensional model for this example. 



4 Local PCAIV 

Local PCAIV is another way to explain the local variation of the dependent vari- 
ables Vi • Vg by a linear transformation of the explanatory variables Xi- - Xp. 
We have to recall first the expression of local inertia II as a natural extension of 
the local variance (1): 

II = 

hj 

where K. G IR^ is the row of the matrix Y. 
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We can define local PCAIV as finding the linear transformation H which realizes 
the minimum of the local inertia of discrepancy between Y and its explanation 
by a linear transformation: 

hi 

From this definition it can be found that: 

H = Y'DEX(X'DEX)-\ 
where E = (D* - MD) 

n 

and D* = diag{p*) and p* = ^ MijPj. 

i=i 

This is a particular case of PCAIV with a new scalar product in the space IR^ 
of variables, defined by DE as it is in the local PLS regression. Although this 
method is a linear explanation of F by A for the local variations, the obtained 
model is obviously different from local PLS regression. Local PCAIV can be 
defined as searching an optimal metric in the space of observations of the latent 
variables IR^. We can write this metric N as 

N = {X'DEX)-^X'DEYQY'DEX{X'DEX)-\ 

and of course an a priori choice of the metric in IR^, like in local PLS regression 
is not possible. 

Because of the projection on the space spanned by the column of X, multi- 
collinearity within X must be dealt by the use of a Moore Penrose inverse instead 
of the normal inverse. In this case, a better predictive power is achieved by PLS 
regression. We will illustrate this method using a data set from ecology. 



5 Local multivariate analysis and autoregressive errors 

To take into account the non-independence of the observations in a regression 
framework, it is sometimes useful to assume that the error term is autocorrelated. 
Several classical models of autocorrelation exist and for a complete description 
of them we refer to Cliff and Ord (1981). If a conditional autoregressive model 
(CAR) is chosen, then the variance of error can be written as: 

Var{e) = a\]in ~ pW)~^ 

where (a, p) are parameters of the CAR errors model, 

W are the known coefficients of autoregression. 

The least square estimator is then the projection of Y on the vector space spanned 
by the column of X with respect to the metric Var{e)~^. It can be found easily 
that, if the weights of the observations {pi} are chosen equal to ^ and M = 
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pW, then using local multivariate methods or an CAR error term in a regression 
framework are equivalent in the sense that they use both the same metric. This 
particular metric can also be found in local Principal Component Analysis or in a 
local Correspondence Analysis (Thioulouse, Chessel and Champely, 1995). All 
these analysis build a local multivariate data analysis framework, following the 
paper of Lebart (1969). 
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Abstract: This paper presents an overview of the main non-parametric 
approaches in regression and multidimensional data analysis. Most of the 
techniques will be introduced in compact form just to give the essential idea 
which are relevant to non-parametric methods. 

Key words Non-parametric Regression, Non linear Principal Component 
Analysis, Optimal Spline transformation. 



1. Introduction 

In literature (Stone 1985) the basic aspects of a statistical model are: 
flexibility, dimensionality and interpretability. The flexibility is the model 
capacity to give unbiased estimates which is extremely helpful in a preliminary 
and exploratory data analysis. The dimensionality is strictly related with the 
variance estimate, in particular, “curse of dimensionality” means that 
exponentially increasing data points are requested to populate spaces of 
augmenting dimensions. The interpretability represents the capacity of the 
model to point out data structure. The parametric models often do not meet all. 

The term nonparametric refers to the flexible functional form of the 
regression curve Hardle (1989). In non-parametric regression models, the 
experimenter looks at flexibility properties of the regression function in fitting 
the data. In contrast under a parametric model only one possible family of 
curves is chosen and the data information is restricted to the assumed 
parametric form. It should be noted that, even though parametric and non- 
parametric models represent distinct different approaches to regression analysis, 
non-parametric techniques can be used to asses the validity of parametric ones. 
In low dimensional settings, global parametric modeling has been successfully 
generalised using paradigms-piecewise and local parametric fitting and 
roughness penalty methods (the most popular are based on splines). In higher 
dimensional settings, a simultaneous estimation of a large number of 
parameters is not conceivable fixing them all externally to the procedure and 
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only non-parametric regression techniques are able to give flexible function 
estimates providing a versatile pre-screening method for outliers. 

In multidimensional data analysis (MDA), the non-parametric procedures have 
found adhesion with the introduction of non linear versions of the more familiar 
linear techniques. So firstly the Gift System (Gifi 1990) is introduced on the 
idea to optimise an objective loss function by means of Alternating Least 
Squares {ALS) algorithms. The least squares estimates gives the parameters of 
one subset while assuming constants the others. In this complex data analysis 
system the authors pay attention to the optimal estimates of the model 
parameters. Subsequently in the literature more attention has been devoted to 
the optimal estimates of non linear transformation parameters of the original 
variables {non linear Principal Component Analysis, van Rijckevorsel 1987, 
van Rijckevorsel et al 1988, van Rijckevorsel et a/., 1993 and Tessitore et ai, 
1996) and of the principal components {Adaptive Principal Surfaces, LeBlanc et 
al., 1994, Principal Surface Constrained Analysis, et al 1997a,b,c). 



2. Non-parametric Regression Models 

In this section we provide a brief description of the most popular approaches 
to non parametric regression problem (such as kernel, smoothing spline and 
least-squares spline estimators) and an overview of several promising new 
nonparametric estimation techniques for analysis of high dimensional data. 

A non parametric linear estimator of a simple regression curve will have the 

n 

form E{y\x) = Y, where K{ ,x.;X) is a set of weight functions 

/=i 

which depend on A . A simpler method of choosing the weights is to use a 
symmetric unimodal probability density function K defined on [-1;1] for some 
real number A>0 {A, called the bandwidth, governs the amount of averaging 
and controls the peakedness of the function) and estimate E(y|x) by kernel 
estimators (Dirichlet kernel form of classical Fourier series estimators, Eubank 

1988): Unfortunately the lack of 

/=i / /=i 

data at the boundaries causes a boundary bias problem for kernel estimators. 
Differently spline functions (regression or smoothing spline) are suitable for 
their property of flexibility. The spline function is a linear combination of a 
suitable set of piecewise function (usually Basis splines, Eubank 1988), defined 
on each interval; a non-negative combination of integrals of B. splines defines 
monotone splines denoted as I-splines. A spline knot sequence (number and 
location) partitions the domain of a continuous variable into a limited number 
of adjoining intervals. The dimension of a spline function depends on the 
degree and the number of interior knots. Concerning smoothing spline, the 
regression function estimator, which belongs to the Sobolev space Wf [0,1] , is: 
argminl//?2I”=i(T/ ~ {xf dx /l>0, where Xt are in [0,1], 
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the solution is unique and belongs to the space of univariate natural spline. 
Knots are located at distinct data points and the parameter X , selected through 
visual inspection of the data, controls the trade-off between the fit to the data 
and the smoothness of the estimator. By the way they are used, smoothing 
splines mainly differ from regression splines. With respect to regression or 
least-squares spline, the tuning parameter X (knot sequence) in the univariate 
setting is estimated using least-squares. 

Let = JX^J, xeR^; / = !,...,« be the standard linear regression 

model. The estimation of this model in a non parametric setting could be 
accomplished by using Laplacian smoothing spline or multivariate kernel 
estimators, but problems arise when p is large increasing variance for increasing 
dimensionality (curse of dimensionality) and the lack of interpretability of 
multivariate estimators. An alternative to multivariate smoothers is to consider 
the following popular generalisations: 

i) The y mean depends on explicative variables in additive way (Breiman 
1993). This interpretation allows the substitution of xj with p simple functions 
of Xj estimated using the univariate estimators (circumventing the curse of 
dimensionality). 

ii) The y mean depends on one linear combination of predictors. The 
mean changes along one direction in the space BF and it is constant on the 
orthogonal directions. 

In the i) interpretation are relevant Generalised Additive Models {GAM, 
Hastie et a/. 1986) : 

fj ® 

where ju(x) = E(y | x) and the function g( ) , assumed note, is called the link 
function. Estimation of fj in GAM be based on the baclrfitting algorithm of 
Friedman et al (1981), which is a Gauss-Seidel type procedure and implements 
at each step an univariate smoother. It forms the “inner loop” of the Alternating 
Conditional Expectation Model {ACE, Breiman et al 1985) algorithm. ACE is a 
non-parametric generalization of the additive model that fits an additive model 
as part of an alternating estimation procedure.. In ACE the response variable y 
and the predictor variables are replaced by 6{y) and^i(jc^)....^^(jc^). 

p 

The model is E[6{y)\ = Y,4>j{^ij) • provides a versatile method for 

y=i 

estimating the maximal correlation between two variables because the functions 
0 and {(!>j)x^j^p are estimated by maximising the square correlation coefficient. 
The disadvantage of this method (Ramsay 1988) is that when explicative 
variables are correlated the algorithm results very sensitive to their order, 
further it seems more suitable for a correlation than regression analysis. This 
method has been proposed to overcome the limit of the GAM consisting in 
inability to share the additive from the interactive effects of the predictors. If 
the smoother refers to global univariate least squares fits, then backfitting 
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converges to the multivariate least squares solution. Furthermore we remember 
the algorithm alternating variance estimation (AVAS, Tibshirani 1988) which 
improves ACE using an asymptotic stability variance criteria. Another approach 
suggested by Stone (1985) for estimating GAM is to use least-squares splines. 

In the ii) interpretation, an interesting generalisation is the Projection 
Pursuit Regression {PPR, Friedman et al 1981, Buja et al. 1989, Huber 1985) 
expressed in the model: 

E{y\Ti) = 'LtfXa[ x). 

where x is a /^-dimensional variable, the are direction vectors onto which 
the data are projected, the number K of projections is a parameter of the 
procedure and each is an arbitrary one-dimensional function. The GAM is a 
special case of PPR in which there are exactly p directions fixed at the 
coordinate directions (it is less general than PPR but it is more easily 
interpretable). Friedman et al (1983), and Ramsay (1988) propose splines to 
approximate the functions in PPR. Smooth Multiple Additive Regression 
Techniques {SMART, represents an extension of PPR to the case of more 
response variables. The outlined above, non-parametric function estimation 
based on univariate functions is an important forward step and (especially for 
additive modeling) has met practical success solving the curse of 
dimensionality problem. Regrettably this approach has some limitations as a 
general method for estimating a large number of functions. So the further step 
has been the development of strategies for the high dimensional approximation 
based on adaptive computation. Adaptive algorithms for function 
approximation have been developed using both two paradigms, recursive 
partitioning (Breiman et al. 1984) and projection pursuit selection (Friedman 
1981, Friedman 1983). A flexible regression modeling of high dimensional data 
was proposed by Friedman {Multivariate Adaptive Regression Splines, MARS 
1991). The model takes the form of an expansion in product spline basis 
functions, where the parameters associated to the basis are automatically 
determined by a recursive partitioning approach. The result is satisfactory with 
respect to modeling both additive and interactive relationships. 



3. Non-linear Multidimensional Data Analysis 

Principal Component Analysis {PCA) can be generalised in different 
directions to yield Non Linear Principal Component Analysis {NL-PCA, van 
Rijckevorsel 1987, van Rijckevorsel et al 1988).). The incorporation of non 
linear transformations in PCA has quite a long history and can be found among 
others in Gifi (1990). In Gifi system Homogeneity Analysis {HO), also known 
as Multiple Correspondence Analysis, is a key method. 

This analysis is performed by the minimization of a specific loss function. On 
the basis of the restriction imposed on the HO loss, all the kinds of analysis may 
be performed and accomplished by algorithm of the the Alternating Least 
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Squares (ALS) type. In each ALS iteration two steps are alternated, in the first 
step the optimum basis (corresponding with scores for the individuals or 
objects) are computed for given values of the transformations (categories 
quantifications or transformation coefficients), in the second step new values 
for the optimum transformations are calculated for the given basis, previously 
computed. Alternating these steps obviously produces a decreasing loss 
function values sequence. Being the transformation not a priori determined, the 
procedure itself can be considered non parametric. We briefly introduce NL- 
PCA. Let X be the object scores (n,m) matrix, Yj and Gy respectively the {kj,m) 
weighting matrix and the {n,kj) coding matrix (/=!,...,/?), with p variables, n 
objects, m components, and kj basis functions. The loss function in NL-PCA 
(Gifi 1990, van Rijckevorsel 1987) is: 

IOF(X; y)= (x - f (x - Gj Y^ ). 

This equation measures the departure of the transformed variables GyYy 
(/=!,.. .,/?) from an hypothetical X, synthesis and linear combination of the 
original variables, often called the “components”. It is minimised with respect 
to the X and Yy matrices by the use of an ALS algorithm. In reality the 
importance of non-parametric MDA has been stated when more attention has 
been paid for non linear transformation parameters. In particular this has 
concerned the parameters of the transformation splines (the degree and the knot 
sequence). The optimal knot location can be regarded as a choice of the model 
(setting externally parameters, for example a uniform knot placement) or as a 
free parameter to be optimised (it can influence the fit of a subsequent model). 
Splines have been used to transform the original variables in NL-PCA (Fuzzy 
Coding, van Rijckevorsel et al. 1988, van Rijckevorsel 1987, and for the use of 
I-splines: Besse et al 1995, Ferraty 1997), in the PCA on Instrumental 
Variables (Durand 1993), in NL Non Symmetric Correspondence Analysis 
(Verde 1992), in Discriminant Analysis (Hastie et al. 1994), in the distance 
approach to MDA Principal Distance Analysis (whose objective falls within the 
Multidimensional Scaling domain (Meulman 1986,1992) and the components in 
Adaptive Principal Surface (LeBlanc et al, 1994), in Constrained PCA 
(Lombardo et al 1997a,b,c). In order to get an optimal transformation of the 
original variables both coherent with the purpose of the analysis and the data 
structure, van Rijckevorsel et al (1993) propose an adaptive procedure for the 
non-parametric detection of the optimal knot sequence, using the recursive 
partitioning approach. Furthermore to find a good compromise between the 
optimality of the function approximation of the original variables and the over- 
fitting problems both tied to increasing number of parameters, Tessitore et al. 
(1996) propose an algorithm to detect the optimal number of knots using the 
Generalised Cross Validation criterion (Craven et al 1979). Lauro et al(\996) 
extend the non-parametric optimal knot location detection to the multivariate 
coding through multivariate B-splines. In the context of non linear PCA by 
means splines smoothers: smoothing splines, hybrid splines, monotone 
smoothing splines and monotone hybrid splines, Ferraty (1997) proposes 
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smooth optimal transformations of the original variables, which maximises the 
explained variance of the transformed data. A bootstrap procedure adapted to 
the model leads to an optimal choice of the smoothing parameter. Durand 
(1993) proposes Additive Spline PC A on Instrumental Variables (ASPCAIV) 
which combines features of multi-response additive spline regression analysis 
and PCA via a spline transformation of the predictor matrix. Dimension 
reduction occurs after a linear smoothing procedure has been applied. The 
associated smoother matrix is the projection matrix which is to be compared 
with that of the least squares splines. Although projections are processed on 
spaces of smaller dimensions, ASPCAIV is however sensitive to scarcity of data 
and multicollinearity. Hastie et al (1994) propose non-parametric versions of 
Discriminant Analysis (DA). Linear DA is a tool for multigroup classification, it 
can be seen as equivalent to multi-response linear regression using optimal 
scorings to represent the groups. By replacing linear regression by any non- 
parametric regression method different non-parametric versions are obtained. In 
this way any multi-response regression technique (such as MARS) can be post- 
processed to improve its classification performance. Furthermore very 
interesting it has been the extension of non-parametric adaptive procedures to 
the principal components. The main aim of these recent techniques is the 
optimal transformation of the principal components in order to represent objects 
taking into account the variables interactions by computing multivariate splines. 
While the transformed original variables are still considered univariate, the non 
linear components are now considered both as single and interactive variables. 
Le Blanc et al (1994) develop a non linear generalization of PCA for 
dimensionality reduction. A Principal Surface (PS) of the data is constructed 
adaptively using MARS procedure. PS are multivariate splines obtained as 
tensor products of univariate transformation splines of the single components 

/ \ M [ 

by means of the following formula: f(c)= where p 

are the coefficients for the construction of the spline transformation; M is the 
number of basis functions of the multivariate spline; is the basis number of 
the univariate spline to construct each tensor; b is the basis function of the 
generic component on the partition of the multivariate domain, used in 
the term of the product. The optimal PS is constructed adaptively using 
the MARS procedure. In order to detect the optimal number of components used 
to construct the PS (space dimension problem) the GCV criterion is considered. 
In the frame of constrained PCA Lombardo et al (1997b) baptise the analysis 
Principal Surfaces Constrained Analysis (PSCA) in the way it relaxes linearity 
assumption of principal components, permitting to represent data on PS 
optimally computed by an adaptive procedure. They look for the optimal PS 
representation optimising the knot location of the multivariate transformation 
splines through a modified version of the recursive partitioning procedure 
(Friedman 1991). As the non linearity invests the components and not anymore 
the original variables the knot detector optimises the following criterion: 
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“ /(^7 ))^ ]• The vector \j (the column of X) and each^Cy) is a 

non linear parameterised function mapping into R. The resulting 
“multidimensional” representation allows to investigate in depth the similarities 
among objects due to a particular variable (classification variable). Furthermore 
in order to avoid over-fitting problems related to the number of transformation 
parameters, Lombardo et al. (1997c) propose an algorithm for the optimal 
number of knots detection. At the end, to complete this overview, it is 
worthwhile to remind Additive Spline Partial Least Squares (ASPLS, Durand & 
Sabatier 1997). The ASPLS regression consists in carrying out a PCA of 
predictors matrix, using spline transformation for principal components which 
play the role of explanatory variables, allowing an additive approximation to 
the response variables. In conclusion, this paper provides a brief summary of 
the most selected topics concerning non-parametric procedures in regression 
and multidimensional data analysis, in order to sketch that theory part opening 
new development and perspectives. 
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Summary: The center /x of a multivariate data set {xi, ...,Xn} can be defined 
as minimizer of the norm of a vector of distances. In this paper, we consider a 
vector of Li-distances y'^ = (||x^ — /x||i, ||x^ — /x||i) and a vector of L2-distances 
y'2 = (||xi — /x||2, ||xn — /x||2)- We define the Li-median and the Li-mean as the 
minimizers of respectively the Li~ and the L2-norm of yi; and then the L2-median 
and the L2-mean as the minimizers of respectively the L\- and the L2-norm of y2. 
In doing so, we obtain four alternatives to describe the center of a multivariate 
data set. While three of them have been already investigated in the statistical 
literature, the Li-mean appears to be a new concept. Contrary to the Li-median, 
the Li-mean is proved to be unique in almost all situations. 

Key words: Mean, Median, Measure of centrality. Multivariate data, Li-norm, 
L2-norm, Location. 



1. Introduction 

The median and the mean are two well known criteria used to describe the center 
of a univariate data set. Let a data set x = (xi, ..., x^)' be a vector of The 
median of x can be defined as the point that minimizes the function 

E \^i - MI (1) 

i=l 

and the mean of x as the point /i that minimizes the function 

^ (xi - ilf . ( 2 ) 

i=\ 

The median and the mean of x may also be defined as the minimizer of respectively 
the Li-norm and the L2-norm of the vector of distances 
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since ||y||^ is the same as (1), whereas ||y ||2 is the square root of (2). 

A natural way to extend the definition of the mean from one to many di- 
mensions cases is to define the mean of a multivariate data set as the vector of 
coordinate means, as we find in all multivariate statistical textbooks. While this 
is not problematic, similar extension for the median is not straightforward. The 
vector of coordinate medians is effectively only one of the several possibilities pro- 
posed in the statistical literature to define the median of a multivariate data set. 
Small (1990) surveys several possible definitions of the multivariate median such 
those proposed by Tukey (1975), Oja (1983) and Liu (1990). These have the com- 
mon property of producing the usual definition of the median in the imivariate 
case, but give different kinds of medians in multivariate situations. 

In this paper, we reconsider the problem of defining the median and the mean 
for data sets in dimension higher than one. Our point of departure is to define the 
multivariate median and the multivariate mean as minimizers of respectively the 
Li-norm and the L 2 -norm of a vector of distances, as done in the univariate case. 
There are however several kinds of distances in multivariate spaces. In Section 2 
we consider a vector of Li-distances as well as a vector of L 2 -distances. In doing so 
we obtain two types of multivariate medians and two types of multivariate means 
that are called Li-median, L 2 -median, Li-mean and L 2 -mean. It turns out that 
the Li-median is the vector of coordinate medians and the L 2 -mean the vector 
of coordinate means. The L 2 -median and the Li-mean are less well-known. But 
while the Z/ 2 -median has already been investigated in the statistical literature, the 
Li-mean appears to be a new concept. Thus the Li-mean arises to be a surprising 
alternative to the L 2 -mean which is usually presented as the only possibility of 
generalizing the univariate mean. The question of the uniqueness of the Li-mean 
is discussed in Section 3. Contrary to the Li-median, the Li-mean is proved to 
be unique in almost all situations. 

2. Li- and L 2 - Medians and Means 

Let a data set X = (xi, ...,Xn)^ be a matrix of i.e. each data point x^ = 

(x^i, ...,Xip) is an element of R^. In order to be able to define the median or the 
mean /x == (/ii, ...,//p) of the multivariate data set X in a similar way as for the 
univariate case, we have first to choose a vector of distances and then to minimize 
respectively the Li- and the L 2 -norm of this vector. In this paper we propose to 
choose either the vector of Li -distances 

/ l|xi-/i|li 
\ l|Xn - Mill 



yi = yi (m) = 
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or the vector of L 2 -distances 



/ l|xi -/i|l2 \ 

y2 = y2(M) = 

V l|Xn - /i|l2 / 

The Li-median is then defined as the minimizer of ||yi||i, the Li-mean as the 
minimizer of ||yi ||2 (or equivalently of HyiH^), tho I/ 2 -median as the minimizer of 
||y 2 ||i and the L 2 -mean as the minimizer of ||y 2||2 (or equivalently of I|y 2 ll 2 )- 
Thus the Li-median of X is the point that minimizes 



llyilli =Y1 

i=l 



\^ij f^j\ 

j=i 



n p 
i=lj=l 



The Li-mean of X is the point that minimizes 



i=l V=1 ! 

The L 2 -median of X is the point that minimizes 



I|y2lli =1] 



A 1 53 ^ 2 ) 

Ni=i 



=E A E 

i=i \j=i 



and the Z/ 2 -niean of X is the point that minimizes 



(3) 

(4) 

(5) 



Ily2|l2 =EE (®) 

i=lj=l 

We have therefore two ways to define the median and two ways to define the mean 
of a multivariate data set. 

Let us return to the function (3) minimized by the Li-median. As 

n p p n 

i=lj=l j=liz=l 

the coordinate of the Li-median of X is the point jij that minimizes the 
function ZlILi \^ij ~ and hence is the median of the univariate data set Xj = 
{xij^ ....Xnj )' . As the same argument is valid for each coordinate, it follows that 
the Li-median is simply the vector of coordinate medians. This type of median 
was first introduced by Hayford (1902). It is also studied in Mood (1941) and 
in Haldane (1948) where it is called the arithmetic median. The Li-median of a 
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data set is not unique when the size n of the data set is even as it is the case for 
the univariate median. 

The identification of the L 2 -mean is straightforward too. The function (6) 
that has to be minimized can be rewritten as follows 

i=lj—l jz=li=l 

Thus the coordinate of the Z/ 2 -mean of X is the point fij that minimizes 
the function Y17=i i^ij ~ hence is the mean of the univariate data set 

Xj = (xij, ...^Xnj)' . As the same argument is valid for each coordinate, it follows 
that the i 2 -niean is simply the vector of coordinate means. The L 2 -mean is a 
very old and natural concept. All multivariate books are based on the L 2 -niean. 
The Z/ 2 -mean of a data set is always unique. 

The L 2 -median is also an already known concept. It was first introduced by 
Gini and Galvani (1929) and by Eells (1930) and was called the spatial median by 
Brown (1983), although Haldane (1948) introduced the term of geometrical me- 
dian, and Gower (1974) the term of mediancentre. In his article which treats all 
kinds of multivariate medians, Small (1990) unfortunately introduces the term of 
Li-median to denote the spatial median. This last appelation brings some confu- 
sion since as mentioned above, the spatial median minimizes a sum of L 2 -distances! 
The vector of coordinate medians which minimizes a sum of Li-distances should be 
called I/i -median. The Z/ 2 -median of a data set is unique except in the case when 
the data points are found in a one dimensional subspace, where the L 2 “median 
reduces to the usual median in this subspace (see e.g. Milasevic and Ducharme, 
1987). 

Let us now differentiate the function (4) minimized by the Li-mean. We obtain 
as partial derivative (so long as Xij ^ fij for each i = 1, 

= -2 sgn (xy - Hj) ■ ||xi - /x||i 

where the function sgn{x) is defined to take the value +1 and —liix > 0, 
respectively if x < 0. Thus if /x is an Li-mean of X, we have (for each j — 1, ...,p) 
that the sum of the Li -norms of the vectors — /x for which Xij < fij is equal to 
the sum of the Li-norms of the vectors — pL for which Xij > pLj, that is 

llxi-/i|ii= Y, (7) 

To our knowledge the Li-mean has never been studied. 
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3. Uniqueness of the Li-mean 



The function (4) minimized by the Li-mean is a continuous real valued convex 
function defined in R^, which tends to infinity in all directions of R^. It follows 
from the theory of convexity (see e.g. Rockafellar, 1972, p.265) that the minimum 
of (4) is attained on an non empty closed bounded convex set. This result ensures 
the existence of the Li-mean but not its uniqueness. 

To see the convexity of (4), let us consider two points /x = (/xi, ..., /ip) and 
u = (i/i, ..., z/p) and a real number A such that 0 < A < 1. We have 



||yi((l - A)/i + Ai/)||^ = 



< 



E 

1=1 

n 

E 

1=1 

n 

E 



\j=^ 



1(1 A) [xij /ij) + A [xij 







7=1 



7=1 




2 



with the equality if and only if 

Vz = 1, ..., n \/j = 1, ...,p [xij < fij and Xij < Uj) or 
As we have (a and b being two positive real numbers) 



(xij > fij and Xij > Uj) . 

( 8 ) 



(1 - A) + Xb^ - [(1 - A) a + Xbf = A (1 - A) (a - 6)^ > 0 



with the equality if and only if a = 6, we conclude that 



E (1 - E ^ E “ ^3 

i=i \ j=i j=i 

< (i-A)t +Ax:(t 

1=1 \j=l J i=l \j=l 



with the equality if and only if 
Vz = 1, ..., n 



I ^^7 /^7'l 1^*7 ^7 I 

7=1 7=1 



( 9 ) 



Putting together these two inequalities we obtain the convexity of (4) that is 
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l|yi ((1 - A) /i + \u)\\l < (1 - A) ■ ||yi {n)\\l + X ■ ||yi {v')\\l . ( 10 ) 

with the equality if and only if (8) and ( 9 ) hold. 

Now if /i and u are two distinct minima of ( 4 ), that is are two distinct Li- 
means of the data set X, it is clear that we have the equality in ( 10 ). Thus 
conditions for the non uniqueness of the Li-mean depend precisely on (8) and ( 9 ). 
The following theorem precises these conditons for bivariate data sets. 

Theorem 3 . 1 . Let /x = (/ii, /X2) be an Li-mean of a data set X = (xi, for 
which each data point x^ = (o0ii,Xi2) is an element o/R^. 

(a) If for each z = 1, n we have 

{xii > Pi and Xi2 > P2) or {xn < pi and Xi2 < P2) 

that is if one part of the data are on the top right of p and the other part of the 
data are on the bottom left of p, then the Li-means of X are points of the form 
{pi e,p2 — with e G R. The Li-means of X are hence found on a segment of 
line with slope —1. 

(b) If for each z = 1, n we have 



{xii < Pi and Xi2 > P2) or {xn > pi and Xi2 < P2) 



that is if one part of the data are on the top left of p and the other part of the 
data are on the bottom right of p, then the Li -means of X are points of the form 
(pi p2 s) with e G R . The Li -means of X are hence found on a segment 
of line with slope +1. 

(c) If we are not in one of the situations (a) or (b) then the Li-mean of 'K is 
unique. 

Proof. Let be another Li-mean of X such that u = {pi -hs,/X2 + with 
G R and let 7^ be the difference Ijx^ — u\\^ — ||x^ — p\\^ for z = 1 , ..,n. We 
have 






2=1 



2=1 



= l|yi(A2)ll2 + 2Z;7i- 

2 = 1 2=1 



(11) 
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Figure 3.1: The Li-means of these 18 data points are on the segment of line 

marked with triangles. 

Then partition the data points Xj as follows: let Ai, A2, As and A^ be the sets of 
data points which are respectively on the top right, on the top left, on the bottom 
left and on the bottom right of /x. Note that by (8) Ai, A2, As and A4 are also 
the sets of data points which are respectively on the top right, on the top left, on 
the bottom left and on the bottom right of u. We have therefore = —e — 6 for 
the rii points in Ai, 7^ = s 5 for the ri2 points in ^2, 7z = ^ ^ for fhe ris points 

in As and = —e 6 for the 77,4 points in A4. It follows that the second term of 
the right part of (11) is 



2s ^ 



26 ‘ 



L{«:Xi€A2UA3} {i-.Xi^AiUA^} 

E llxi-Mlli- E l|xi-/i|ii 

|_{i:Xi^i43UA4} {^:XiGi4iUA2} 



+ 



where each [...] is equal to zero by the property (7) of the Li-mean given in Section 
3. Thus, if u is an Li-mean of X, the third term of the right part of (11) 



E 7^ = (^1 + ns) • {s + 6)^ + {u2 + rii) • (e - 5 )^ 

i=l 



must be equal to zero in order to have ||yi {u)\\l = ||yi . If ri2 = = 

0, we must then have 6 = —e that is u = (/ii H-6:,/i2 — s) which is the case 
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(a) of the theorem. If rii = = 0, we must then have S = e that is u = 

(/ii + £,/i 2 + which is the case (b) of the theorem. In any other case we must 
have simultaneously 6 = —e and 6 = e that is ^ s = 0 and i/ = which is the 
case (c) of the theorem. □ 

The consequence of Theorem 3.1 is that the Li-mean of a bivariate data set 
is unique except for very particuliar cases. Figure 3.1 illustrates situation (b) of 
Theorem 3.1. 
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Abstract: Using a B-spline representation for splines, knots seen as free vari- 
ables, the approximation to data by splines improve greatly. The damage due to 
the presence of a lot of local optima is very important in the univariate regres- 
sion context, and things are getting worse in multivariate additive modeling. We 
present a new algorithm to select knots submitted to constraints for computing 
least squares spline approximations. 
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1. Introduction 

The approximation to functions by splines has long been known to improve dra- 
matically if the knots are free parameters. A serious problem is the existence of 
many stationary points for the least squares objective function, and the apparent 
impossibility of deciding when the global optimum has been found. Free knot 
splines have not been as popular as might be expected from these results. One 
problem has been that analytic expressions for optimal knot locations, or even 
for general characteristics of optimal knot distributions, are not easy to derive. 
Computationally, things are a little better: there exist several algorithms to find 
knot locations. Gallant and Fuller (1973) use an iterative algorithm based on 
the Gauss-Newton method to solve the problem. Jupp (1978) introduces a trans- 
formation of the knots to avoid the “lethargy” phenomenon: this transformation 
puts to infinity the knot set boundaries that makes impossible the free knots coa- 
lescence. If we can choose and increase the number of samples, we can use the 
Agarwal and Studden (1980) algorithm. Guertin (1992) penalizes the distance 
from equidistant knots. In the following paper, we recall the free knots prob- 
lem, and the consequences of the “lethargy” theorem. A new algorithm (Molinari 
1997) is proposed to compute knot locations in simple and multiple regression 
by least squares splines. We imagine that a knot can be seen as a convex com- 
bination of the data points. It leads us to oblige the knots to lie within disjoint 
intervals. Efficiency and fastness which characterize the box constrained mini- 
mizing algorithm, allow us to propose a method that explores the local minima 
(knot locations) with an attractive amount of computational time and that defi- 
nitely avoid the coalescence of free knots. 
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2. B-spline functions, fixed and free knots for least-squares 
splines 

2.1 B-spline functions 

Let (<^o =)a < 6 < 6 < • • • < (a < ^a + i ) be a subdivision of the interval 
[a, 6] by I\ distinct points. These points are the “knots” of the spline function. 

A spline function^ s{t), of order -f 1 or degree d isa function which 
(? ) is a polynomial of degree d in each open interval ]{,_i , {,[ for z = 1, A' + 1, 
(n ) has c/ - 1 continuous derivatives in the open interval ]a, 6[. 

For a fixed set of knots, ^ = (6^ * ■ > (a )» the set of such splines is a linear 

space of functions with K + d I free parameters (de Boor 1978). 

A basis for this linear space is given by the Schoenberg’s fi-splines, or Basic- 
splines. This basis provides a stable method (de Boor 1978) for computing spline 
functions written as 



A-f-tf+l. 

l-\ 

where the /3 = (/?i, • is the vector of the spline coefficients. 

2.2 Simple regression and lethargy problem 

Let {x,, ...,n be a set of n observations ranging over [a, 6] x R, Denote 

B(^) = the n x (A' + cf -f 1) matrix of sampled basis 

functions, and y = (yi, • • • , y^)' the vector of sampled response. 

When knots are fixed, it is a straightforward linear least squares problem to 
fit the data by splines (Stone 1985) 

m = arg min ||y - B{i)!3\\^ = {B[i) B[i)' y, (1) 

where ||.|| is the usual Euclidian norm and B^ is the Moore-Penrose inverse of 

B. Then, is estimated by i(x,^) = Pi Bi{x,^). 

When the locations of knots are free variables, the class of splines is no longer 
linear but form a mixture of linear and nonlinear parameters. Then, the least 
squares problem is to find 

arg min ||y - B(^)/3||^ (2) 

When ^ is fixed, the problem (2) reduces to (1). In fact, (Golub and Pereyra 1973) 
shows that the solutions to (2) are those to 

arg min ||y - B($)y9(^)|p. (3) 

As a consequence, we from now on denote the objective function F{^) = ||y — 

s(^)/3(4)r. 
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The following “lethargy” theorem (Jupp 1975) is intrinsic to the free knots 
problem, and affect the stable and effective computation of optimal knots. 

The lethargy theorem: Denote 5k [ a, 6] = G IR ^ ; a < 6 < 6 < * * ’ < 
Ck < b} the open simplex of knots, and the pth(open) main face o/5k[«, b]. 

is defined by the system {(j — ^j_i) > 0, forj p and = ^p_i. On the pth 
main face, iipVF(^) = Oforp 2, • • • , K, where Up is the unit outward 

normal to anJ VF(^) the gradient of F{^). 

The first consequence is the existence of many stationary points of F(^) on 
the faces and the second is the poor convergence, or “lethargy” properties, 
of algorithms that attempt to solve the free knots problem when they are near the 
boundaries Figure 1 shows 52[a, b] which is a triangle. 

Figure 1. The simplex 52[a, 6] as the parameter space for splines with two vari- 




The “lethargy” property is illustrated on the titanium heat data (de Boor and 
Rice 1968) which is a set of 49 data values that express a thermal property of 
titanium. As the writers report, this data set is hard to approximate, and has sig- 
nificant experimental error. Figure 2 shows the approximation by cubic spline 
corresponding to a local minimum with two coalescent knots. 

Figure 2. Titanium heat data approximated by a 5 non optimal knots cubic spline. 







550 



2.3 Additive multiple regression 

Let us consider p centered covariates (A"i, • • • , Xp), and a random centered re- 
sponse Y, all mesured on n observations gathered in the matrices X and Y. The 
problem we consider initially is the estimation of the conditional expectation or 
regression function f{xi, • • • , Xp) = E{Y/Xi = xi, • • • ,Xp = Xp). 

For constructing an additive estimator f of f defined by f{xi^ • - ^Xp) = 

Sj=i where the //s are centered, we use the least squares spline (LSS) 

approximation suggested by Stone (1985). 

When knots are fixed for any predictor j, i.e. given a fixed spline basis 
, the (LSS) estimator of /, is defined by 

,Xp) = Y, 

3=1 /=1 

where m = 0}U), ' ' • ' Jm ' ' ’ is solu- 

tion to the least squares problem 

^(0 = arg min \\Y - = (B'(^)B(^))+S'(^)y. (4) 

/3 

The super vector of spline coefficients is denoted ^ = (^i .• • • >^k, > • . 

and B(^) — [B^(^)| • • • is then x Z^j(A'j+dj + l) column centered 

super coding matrix. 

Stone (1985) showed that 

1=1 

are consistent estimators of the coordinate functions fj(xj). 

As in the univariate case, when knot locations are free variables, Golub and 
Pereyra’s (1973) result also holds, and we only have to minimize the objective 
function F(^) = \\Y - JB(^)/3(^)||^ with respect to 

3. Bounded Optimal knots 

3.1 Description in the univariate case 

Because the knots belong to the interval [a, 6], we impose that one knot, lies 
within some window [/i, u,], and, to avoid the lethargy problem, that the windows 
are disconnected. 

Let Z = (/i, • • • , Ik), u - (ui, • • • , uk) be the vectors of lower and upper 
bounds. To avoid overlapping windows, we take — U{ = e > f) for i = 
1, • • • , — 1, and l\ — a = b — uk e. Note that lim£_,.o ^]* 

When the windows are fixed, the bounded optimal knot problem is to find 

^(Z,u) = arg min F(^). 

i<i<u 



( 5 ) 
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Figure 3. The S' 2 [a, b] simplex and a set of bound constrained knots. 




Clearly C [a, 6] for z = 1, • • • , K, and (5) doesn’t necessarily 

provides the global minimum to (3) because 



min F(^) > min F(^). 

l<^<u - ^G([a,6])A' 



However, problem (5) is very easy to solve by using classical rapid algorithms. 
The visual examination of the spline approximation allows the user to experiment 
different choices for selecting the windows. 



To explore most of the local optima, the strategy presented below provides an 
automatic selection of the windows. We construct a sequence of, 

say, N solutions to (5) based on N independent uniformly distributed partitions 
* • • , l^j^K )i=i,iv- In fact, because upper and lower bounds are explic- 
itly linked (uj = Ij^i-e) we only need to generate a sequence (/ 2 *\ * • • , )t=i, 7 v 

of K — 1 uniformly drawn lower bounds. For sufficiently large N, we expect that 

4 = arg .min «(’*)) 

provides a good approximation to the optimal knot locations. Clearly, the number 
N of experiments is unknown, and we heuristically use N = 100 x K. 

3.2 The multivariate additive case 

Using the notations in 2.3, the natural extension of the precedent procedure leads 
to solve the next optimisation problem wich is analogous to (5) 

^{l,u) = arg min F(^), (6) 

IP<^P<UP 

where the super vector of knots ^ \ • • • , is constrained to lie within the 

super vectors of lower and upper bounds. 
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4 , Applications 

The examples and simulations we made were implemented on a Ultra-Sparc sta- 
tion, throwgh the so called bok (for bounded optimal knots) S-Plus® function that 
uses the native function nlminb applied successively on N sets of windows. By 
default, initial knots are located at the center of the windows. The user has to de- 
cide on 3 types of input parameters: the number, N, of experiments, the number, 
K, of knots, and the degree, d, of the spline polynomials. 

6 ofc(x,y, N, d). 

4.1 The titanium heat data 

Using {N, K, d) = (500, 5, 3), the global optimum of Jupp (1978) is obtained at 
the knot location ^ = (37.62,43.97,47.37,50.20,59.21) after 8400 seconds of 
cpu time. Of course, the number of simulations (500) is an upper bound which 
should be improved in the future. Whatever that may be, selecting the initial 
knots actually located at the center of the windows did not affect the convergence 
of the procedure in any of the examples we tried the convergence of the procedure. 

Figure 4. Titanium heat data approximated by a 5 knots cubic spline. The lo- 
cation of the optimal knots is indicated by the vertical lines. 




4.2 Simulations for multivariate additive modeling 

To illustrate the additive case, the data (Durand 1997) consist of a sample (n = 
30) of p = 5 predictors wich are strongly correlated and generated as follows: X\ 
is uniform on [- 1 , 1 ], X 2 = 0.9Xi + £ 1 , X 3 = -I.IX 2 + £ 2 , X 4 = 0 . 9^3 -f- £3 
and X 5 = — 1 . 1 X 4 + £4 where £1 and £2 are normal ( 0 , 0 . 2 ) while £3 and £4 are 
normal (0, 0.1). The response is generated by Yi = f{Xi) -f £, for z == 1 , • • • , 30, 
with the Si independently drawn from a normal (0, 0.1) distribution. The function 
/ is taken to be 

/(xi, • • • , Xs) = 2 sin( 7 TXi) — 6^2 + 3 x 3 — 2x2 + X5. 

B-splines used for the first 2 predictors are of degree 2 with 3 knots, of degree 1 
with 3 knots for the 3 last variables. 

The author shows that when few samples are used for strongly correlated co- 
variates, least squares spline models obtained with fixed knots, present a large 
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We use the program mbok (for multivariate bounded optimal knots) which is 
the analogous of bok for the additive case: 

mbok(X^ y, TV, /\ , d)^ 

where K is the vector of the number of knots and d the vector of the spline de- 
grees. 

Using {N, d) = (100, (3, 3, 3, 3, 3), (2, 2, 1, 1, 1)), figure 6 shows that well lo- 
cated knots improve the prediction, and that the bounded optimal knots method 
can be apply successfully on additive data. 

5. Conclusion 

The presented method applied in both univariate and multivariate contexts is 
based on the use of bounded optimal knots to avoid coalescent knots. It con- 
structs a non deterministic algorithm which is not sensitive to the crucial problem 
of locating the initial knots. 

Finally, a cross-validation procedure can be used to estimate the remaining tuning 
parameters: the number of knots and the degree of spline functions. 

Acknowledgments. The authors are grateful to the referee for helpful com- 
ments and suggestions. 
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Abstract: For the interpretation of a two-way contingency table cross- 
classification, based on two hierarchies, new indexes are developed, able at 
evaluating relative contribution of either hierarchy nodes to the other's nodes or 
partitions. Indexes are based on groups' inertia projection onto the orthogonal basis, 
associated to the explanatory hierarchy, in a vector space with metrics. An 
application based on Lombardy municipalities 1991 Census data is given. 

Keywords: Contingency tables. Hierarchical classification. Cross-classification. 



1. Introduction 

The study of cross-classification of two-way contingency tables aims at finding the 
best one and at explaining either of the two one-way partitions through the other. 
Orloci (1978) and Feoli and Orloci (1979) propose the analysis of concentration 
for the evaluation of a vegetation table sharpness, it is a kind of correspondence 
analysis based on a normalised table, but its use is limited only to presence-absence 
data. Camiz (1993, 1994) uses it for the search of the best cross-classification in a 
procedure for structuring vegetation tables, but even in this best application the 
association between groups is detected only through the usual correspondence 
analysis practice, namely the inspection of scatter diagrams. Govaert (1984) 
developed techniques based on Diday’s (1970) dynamical clouds, whose results 
depend on the a priori choice of the suitable number of groups of each partition. 
Lebart et a/.(1979) propose statistical methods to identify each group’s typical 
variables. For fi-equency variables, typical are considered those whose occurrence 
in the group is significantly higher or lower than their expected value, on the basis 
of hypergeometrical law. This does not allow a direct comparison of two 
classifications, since they may not be arranged according to a hierarchy: in fact, they 
may be the same for two different groups of the same partition, but at different 
frequency levels. Nevertheless, it is a useful method for a deep insight of a 
classification structure. Benzecri et al (1980) propose a decomposition of nodes 
inertia on canonical bases, with the same drawbacks as Lebart et a/. (1979). 

The proposed technique, that may be used as a stand-alone cross-classification 
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procedure, as developed in Denimal (1997), considers two vector bases associated 
to ascending hierarchical classifications, performed on both rows and columns of 
the two-way contingency table. The associated indices are used to evaluate the 
relative contribution of one hierarchy nodes to both other hierarchy nodes and 
partitions. In this way a relation may be found between hierarchies and partitions, 
helping in both interpretation and graphical representation of results. 



2. The method 

Given a two-way contingency table K, for two sets of characters I and J, we 
consider two hierarchies If, If built on I and J, respectively, using Ward's (1963; 
see also Greenacre, 1988) criterion. Two orthogonal bases are associated on vector 
spaces and ^ ^ to these hierarchies, where each vector (but one) represents a 
node in the corresponding hierarchy. Since these bases are orthogonal, they allow 
us to decompose the squared distance between each group centroid and the grand 
centroid. Thus, ituiices may be established, to interpret hierarchy and partitions on 
either set in terms of the hierarchy on the other. 

Let be k^j, i el, j a generic element of AT. As usual, marginal and grand totals 
may be written as k, . = k^j , k.j =£i^ky,k.. = Z(u)ei>j kj- Given two subsets 
p s:I, q ^J, we denote by and k^ the partial sums k^ = Xg, K-. K "" Ejeq 
^pq = Eoj)ep>^q Kj ■ If we set these in the correspondence analysis frame, where each 
element i e I \s represented by a point (k^/ k,.)j^j e If with mass A , . /k.. (and 
analogous formulae hold for j), all subsets p and q centroids have representations 
given by c p = (k^j/kj,) eR'^^wA c/= ^i^/k‘) eR ''' respectively, where f 
and f are the grand centroids, i.e. the marginals. If we provide the vector spaces 

and with the metrics, a basis associated to hierarchy (e’’, e’ 

for R ^ may be given by; 

y/eJ , 

Vm = {q„q^)eH\ 



,0 - 
J 



K!K 

J 9l 

- k.Jk 

J <h 
0 



ifye^j 

ifye?2 

Otherwise 



( 1 ) 



with qi and q the two subsets that merge at the /w-th node. This basis is 
orthogonal (Benzteri etcoll, 1973-82; Weiss 1978; Cazes, 1984) and an analogous 
one may be given for 

To interpret hierarchy in terms of the squared distance \cfp-c^ f of the 
branch or group p from the centroid c ^ is partitioned according to the hierarchy If 
associated basis elements, and we have 






.-c'f = 



E 






( 2 ) 
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where 



k k 

k.. 



p'M k 



P<1\ 

kk 

p <i\ 






k k 

P ^2 i 



(3) 



As a consequence, for each group p of If hierarchy, indices / ^ 



^'p.(9i,92) 



may be calculated for the most significant nodes m = (qj, q 2 ) sH'^. In particular, 
indices /, summing to 1, measure the relative contribution of the hierarchy nodes in 
the explanation of group’s distance from the centroid. In addition, if an node is 

kp^ k^^ 

composed by branches qj and q 2 , both quantities / = — ^ fva ^ 

k k kk„ 



or their logarithm = log^^^y and lfp ^2 = may be used to check their 

association to group p of hierarchy. If^^, > 7. (i.e. Ifp^^ > 0.) association 
between groups p and q^ is positive, and negative if fp^^ < 1. (i.e. If p^i < 0.). In 
this way one may outline the additive contribution of the groups forming the 
hierarchy nodes to the other hierarchy’s nodes or partition groups. Analogous 
indices A \ and 5 may be calculated for groups q of hierarchy. 



3. Application to Lombardia 1991 Census data 

The data are the 575 indicators of 1991 Italian population Census, taken for the 
1546 municipalities of Lombardia region, concerning population and dwellings. The 
aim of the exploratory analysis (Camiz and Tulli, 1996) was to classify the 
municipalities, based on the similarity of both population and dwellings characters. 
Correspondence analysis (Benzecri et coll, 1973-82; Lebart et al, 1984) gave 
three main axes summarizing 83% of total data table information (inertia). This 
results in a very strong synthesis, that we summarize here: on the first axis, rural 
lowland areas of Mantova and Brescia regions are opposed to Milan and other main 
cities; the second axis puts in evidence all holiday towns and villages, with high 
density of holiday houses; the third opposes Milan and small villages to the other 
main cities of Lombardia, i.e. it opposes the houses surface to all items tied to both 
population and houses abundance. Based on these three axes coordinates, a 
hierarchical ascending classification was performed, using Ward's (1963) method, 
on both municipalities and indicators. In both cases, considering the variance within 
groups, the partitions in 6 and 9 groups were the most interesting. In must be noted 
that the total information given by these reduced tables was nearly 77% and 80% 
of the original table, respectively: an outstanding result, since, in particular, the 
latter corresponds nearly to the amount of inertia explained by the three chosen 
axes. 
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The problem was now to choose the best cross-classification and to explain the 
municipalities groups, based on the values of Census indicators: a task very hard to 
perform, due to the very high number of items involved. For this reason, the said 
technique of Lebart et al (1979) was very difficult to be used, at least at a first 
level; in addition, it was impossible to cross this characterisation with the indicators 
clustering. Feoli and Orloci (1987) concentration analysis was not applicable, 
limited as it is to only presence/absence data. In Table 1 indices Ifp^^ are reported, 
showing the influence of the hierarchy built on Census indicators on the 
municipalities hierarchy’s nodes and the derived partition into 6 groups. Different 
cells shadowing indicates the intensity of influence, the lighter background meaning 
lowest frequencies, thus negative contribution, and the darker meaning highest 
frequencies, thus positive contribution. Through the inspection of this table and the 
reciprocal one, based on municipalities hierarchy, we noticed that there was a very 
strong influence of items of each hierarchy to the other: this suggests a very good 
agreement among them, nearly a block-diagonal structure. 



Table 1 : Indices Ifp^^ corresponding to the first five nodes of hierarchies and 6><6 
cross-classification of Lombardia Census data. Rows represent municipalities 
nodes (CA...CE) and groups (C1...C6) and columns indicators nodes (RA...RE) 
and groups (R1...R5, Milan). Within parentheses is given each group frequency. 
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In particular, node CA highly influences the node RA. In fact, the branch CB, 
composed by items defining country sites and tourist areas, has a highest frequency 
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on municipalities in branch RC, composed by small centers in country and mountain 
sites, and lowest on those in branch RB\ on the opposite, indicators of branch CC, 
related to urban population characters, have high frequencies in municipalities in 
branch RB, namely Milan and other large cities. The influence of second level node 
was also highly significant, since it splits both the country site municipality and 
items into purely rural (C2, R4) and holiday site characters (C7, RD). The further 
nodes took Milan apart from the other cities {R5), due mainly to the items related 
to densely populated areas, with large houses (C3), absent in Milan; on the 
opposite, in Milan have high frequencies those items tied to rich residential areas 
and tertiary production. Indices corresponding to further nodes were useful to 
identify lower level nodes relationships, that were not as evident as the previous. 
They tell apart rural areas, up to the 6 x 6 level, that was thus suggested as a best 
cross-classification. This could mean that from that level on the interpretation of the 
cross-classification was no longer possible, and that differences among further 
partition groups of one hierarchy may not be understood only on the basis of the 
nodes of the other one. 



4. Conclusion 

Specifically suited for the interpretation of a cross-classification, this method is 
proved to be usefiil even when the interpretation of only one classification is 
important, provided that it may be explained by the other hierarchy. In addition, 
when cross-classification is an important issue, the method is helpfiil in detecting 
the lowest levels where each partition may be explained through the other. In the 
case of Census data, it was effective in detecting its most interesting features, that 
were not easily detected through alternative techniques. The fact that the 
interpretation was limited to the highest hierarchies levels means that deeper 
partitions, in particular those of municipalities, do not correspond to a cross- 
classification. Deeper partition may be explained only considering different mean 
frequencies of the same items in the different groups, something that this technique 
may not detect. Although very difficult to be used as a first step, when many 
variables are involved, Lebart et al (1979) technique may be more suited for this 
purpose as a second step, involving quantitative aspects of only a selected number 
of items of interest. 

Some development directions are now under investigation: the indices distribution 
and significance, the graphical representation of results, including an improved cells 
shadowing and the representation of block-structure, as Benzecri et al. (1973-82: 
I, 82 and 365) suggest, the use of the procedure in order to improve the vegetation 
data tables structuring, very close to Bertin (1977) graphical representation of data, 
and the generalisation to other cases, as three-way contingency data tables, 
individual x characters, and individual x variables tables. 
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Abstract: The aim of this paper is to propose a comparison between two different 
methods for the restoration of degraded images, whose generating model is 
assumed to be a Gauss-Markov Random Field (GMRF): the former is a Bayesian 
method, the so-called Iterated Conditional Modes (ICM ; Besag,1986), and the 
latter is based on a Kalman-type filtering and smoothing algorithm (Moura, 
Balram,1993). After briefly presenting the fundamental concepts of each of these 
two approaches, we will give the results of an application on a benchmark image, 
together with some comments and conclusions. 

Keywords: Gauss-Markov Random Fields, Bayesian Image Analysis, Kalman 
Filter. 



1. Gauss-Markov Random Fields in image analysis 

The basic problem we are engaged with is the restoration of an image x, given the 
observation of a degraded image y=y(H(x), s), where His a blurring matrix and s 
is a noise independent of x. In order to establish the background for statistical 
image modelling, it will be assumed that any image of interest can be represented 
on a NxN(= n) square regular lattice S whose pixels co-ordinates are indicated 
through the pair of integers (hj), A conventional rastertype 

scan will be assumed as follows: left to right scan, advance one line, repeat, so 
that we indicate each pixel variable through one only index, i.e. x, will denote the 

grey level assumed by the variable at pixel i (=1,2, ,«). In order to assign a 

probability measure over the space of all the possible configurations 
acgQ(={0,1,...,T-7}") we define a Markov Random Field (Mrf) as follows (Besag, 
1974): 

f(x)=exp{-U(x)}/ Z (1) 

where 



U(x)= Y.^iGi{xi) + Y, 'Y^iXpij[xi,Xj)+ 

l<i<n l<i<n 

is the energy function of statistical mechanics, the G(.) functions represent the 
potentials of the fields - that is, the local interaction parameters - and Z is the 
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normalising constant which permits to the function to integrate to one. Equation 
(1) is of a general form and includes higher order interactions, but we will deal 
only with pairwise interaction Mrfs, that is those fields for which the only cliques 
are those consisting of one or two pixels, which can be considered neighbours on 
the basis of some distance criterion. In fact, the key to the definition of a Mrf is 
the specification of the neighbourhood system Sj on which the Markov property 
holds: 

P(xi\xj, =P(xi\xj, j eSi) t/ i eS 

Then, the Hammersley-Clifford theorem states that the joint density (probability) 
function of X has the general form (1), and that the G(.) functions are non zero if 
and only if its arguments are defined over a clique. 

In the remainder of the work we will consider the true scene as a realization of a 
zero-mean Gauss-Markov Random Field (GMRF), that is, the distribution of Xj, 
conditional upon its neighbours, is Normal, with parameters: 

E(xi\ xj, j s 50 = Y^ PijXj ; Var(xi\ xj, je50 = 'fi 

J*i 

It follows (Besag, 1 974) that the joint distribution of X is 

f(x)cxexp{-^x'Ax} (2) 

2r 

where A is the (nxn) potential matrix, with entries equal to 1 along the principal 
diagonal, to -Pij if the zones i and j are neighbours (according to some fixed 
distance criterion), and zero elsewhere. 



2. Bayesian approach to image restoration. 

The Bayesian ideas concerning image analysis may be briefly reassumed as 
follows: given a model for the true scene, /(3c^, and a suitable choice for the noise 
XQTm,f(y\x), we get, by applying Bayes’ theorem: 



f(x\y) = 



f(y\x)f(x) 

f(y) 



( 3 ) 



so that the best reconstruction, H , is the one which maximises f(x\y)ozf(y\x)f(x). 
This is the principle of the so called Maximum a Posteriori procedures, but it is 
clear that the search of the maximum of the very complex posterior functions 
coming from (3) implies great computational problems. Instead, we focus our 
attention on a particular restoration method known as Iterated Conditional Modes 
(ICM). This is a deterministic, fast algorithm which derives from the following 
expression: 




563 



f(Xi\yi, ^ s/M(yi\Xi) f(Xi\ X Si) (4) 

For the Gaussian case, we can see in (4) that f(yj\Xi), that is the observation density 
conditional on the model realisation, is normal, with zero mean and variance a, 
and that f(Xi\tsi) is the model (Gaussian) conditional density. In order to 
maximise the posterior distribution, it is very easy to demonstrate that the best 
estimate of the true scene is obtained by minimising: 

cr^(y-x) Yy-x) +x’Ax 

so that the updating formula at pixel /, during a single ICM cycle, is: 

je. = tL 

The algorithm is very easy to implement, and is based on an iterative image 
restoration together with parametric estimation, and has been demonstrated that 
the posterior f( ^ \y) converges rapidly to a (local) maximum. 



3. A ‘‘classic” approach 

In this case, in order to apply recursive algorithms such as Kalman Filter, it is 
necessary to put the model, i.e. a GMRF, in a state-space form. The starting point 
is the equivalence between our conditional model (2) and the simultaneous 
representation given by Woods (1972): Ax=s^ with £-~MVN((?,r^y4) and A the 
potential matrix. Then, by using the Cholesky factorization and the particularly 
regular structure of the matrix it is possible (Moura, Balram,1993) to 
implement a recursive Kalman-type filtering and smoothing algorithm (Rauch, 
Tung & Striebel, 1965) for the reconstruction of a degraded image y modelled as 
follows: y=x^w, i.e. a GMRF plus an additive zero-mean independent Gaussian 
noise with E[ww^]=<jL 

The principal problem to overcome is that of the estimation of the parameter 
vector 9^[P’ r a ]\ where p is the vector collecting the potentials of the field. As 
regards cr, it has to be said that it is impossible to obtain its estimate with 
standard statistical methods, so that it will be necessary to “guess” it through a 
preprocessing of a uniform portion of the image. The maximum likelihood (ML) 
estimation of the parameters of the field implies the minimisation of the following 
negative log-likelihood function: 



L(x;P,-fi) = 0.5ln(v^) -(0.5/ N%\Ap(/3)\ + [0.5 /(t 1^)] x’Ap(p)x (5) 
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The greatest difficulty in minimising (5) is given by the presence of the 
determinant of the potential matrix, which is a very complex function of the 
parameters. In any case, it has been demonstrated (Balram & Moura, 1992) that, 
for first- and a particular class of second-order fields (those with equal diagonal 
interaction parameters), it is possible to derive an explicit expression for the 
eigenvalues oiAp(P), and in this way the likelihood function (5) may be obtained 
as a simple function of the parameters. 

4. Experimental results and conclusions 

We have applied the two methodologies (bidimensional Kalman filter and ICM) 
to a part of a demo image (a mandrill face) contained in our programming tool, 
MATLAB V.5; the lattice is of dimensions 100x100 pixels, and the grey levels 
vary in the range [0,220]. We added a zero-mean white Gaussian noise with 
variance equal to 2500, so producing a Signal-to-Noise (SNR) ratio of about 1.8. 
Figs. 1(a) and 1(b) show, respectively, the original and the noisy image. 

Figure 1: (a) original image (mandrilVs eye); (b) noisy image (<j=2500) 




(a) (b) 

The following Figs.2(a) and 2(b) are the visual results obtained, respectively, by 
means of the application of the filtering and smoothing phase of the Kalman-type 
algorithm for a first-order homogeneous GMRF. The ML parameters estimates 
are: for (horizontal interaction parameter), 0.34; for P^ (vertical interaction 
parameter), 0.15; and for r (conditional variance), 885.42. 

Figure 2: application of bidimensional Kalman algorithm; (a) restoration after 
filtering phase, and (b) after smoothing phase 




(a) (b) 

The goodness of reconstruction has been tested with the calculation of the Mean 
Squared Error (MSE) between Fig. 1(a) and each of the other ones; they are, 
respectively, equal to 2503.6 (Fig.l(b)), to 2065.4 (Fig 2(a)), and to 1676.1 
(Fig.2(b)). As it can be seen, the enhancement between filtering and smoothing 
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phase is in the order of the 15.55% of the original MSE. 

The Bayesian approach followed by ICM has led, in successive iterations, to the 
images presented in Figs3(a)-(d); the parameters have been estimated by means of 
the maximum pseudo-likelihood (MPL) method, which is based upon the 
maximisation of the product of the conditional density functions of the current 
reconstructed image: 



PL(e)=f{[Pr(x,\xj:j^i; 9)] 

i=l 



Figure 3: application of ICM algorithm; from (a) to (d): iterations from 1 to 4 




Numerical results, including parametric estimation and MSE between the original 
image and the reconstructions, are presented in Table 1 . 

Table 1 : results of the application of ICM to Fig. 1(b). 
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836.13 
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0.56 


0.02 


45.91 


2032.00 


857.09 



As it can be noted, we did non present the resulting image for the 5th iteration, 
because of the augmented MSE value. 

The results of our application would seem to outline an evident better performance 
of the ICM method, when compared to that of the Kalman filtering and smoothing 
algorithm. Notwithstanding a very good image restoration, the Bayesian approach 
we have considered presents at least to problems to be remarked: a) the first one is 
related to pseudo-likelihood estimation which furnishes, as it can be seen in Tab.l, 
parameter values always outside of the admissible region [we remember that a 
sufficient condition, in the case of first-order GMREs, is \f^\-^\j3)<0.5]; moreover, 
it can be seen that MPL is not a well-suited method for parametric estimation in 
presence of corrupted data: we have to point out that there is a need for some sort 
of correction to avoid the great bias in the estimated parameters; b) it is not clear 
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when the iteration has to be stopped; we have adopted, at this purpose, the 
statistical MSE measure, but it seems to be obvious that the algorithm does not 
converge towards a fixed reconstruction. 
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Abstract: Projection pursuit regression is generally greatly influenced by 
outlying points. Aiming at robustness of this non-parametric regression method, 
we propose to resort to regression M-estimates. Robust estimates of the 
projection pursuit regression model parameters are obtained, as in the robust 
linear regression context, by an iterative reweighting procedure, with 
observation weights determined by the bisquare weight function. 

Key words: Projection pursuit regression, robust regression, M-estimators, 
bisquare weight function. 



1. Projection pursuit regression 

Projection pursuit regression (PPR), introduced by Friedman and Stuetzle 
(1981) and refmed by Friedman (1984), is a non parametric regression method 
for modelling a random vector Y, the response variable, as a function of a p- 
dimensional random vector X, of predictors, on the basis of a sample of n 
observations i = l = of (Y,X). The response 

variable is modelled as a linear combination of smooth functions of linear 
combinations of the predictor variables. The model takes the form: 

E[Y\x]=g{x)^E[Y]+J^P^fJa„'x) (1) 

m=l 

where are p-dimensional unit projection directions, and are univariate 
smooth functions, with zero mean and unit variance, of the projections ocjx, 
m=l,2,..,M> The coefficients and and the functions are parameters 
of the model and are estimated so as to minimise the criterion: 

M 
m=7 



L2=£ 



( 2 ) 
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PPR algorithm estimates expected values by weighted means, where the 
observation weights, w/, i = 1, 2,..., n , are specified by the user. The default is 
all weights equal unity. In this paper we evaluate the use of observation weights 
to implement an iterative reweighting scheme for robustification. 



2. Robustification of PPR via regression M-estimates 

M-estimation, introduced by Huber (1973), is probably the best known 
procedure to reduce the influence of outlying points for the linear regression 
model. This works by replacing the usual sum of squared residuals by a sum of 
less rapidly increasing functions of the residuals whose general form is: 

'^p({yi-P'xi)/s) (3) 

i=l 

where 5 is a robust estimate of scale and P is the vector of the regression 
coefficients. If the parameter estimates are not to be greatly influenced by a few 
data points which are far from the regression plane, then, in general, p(.) must 
be bounded. The M-estimate of can be found by minimising (3) or, after 
taking derivatives, by solving the system of (p-^1) equations: 

i=o,i,...,p (4) 

i=l 

where y/ = p' and xiq = 7, The most widely used procedure to 

solve (4) is iteratively reweighted least squares, with weights 
Wi = y/{{yi - P' Xi)/s)J{{yi - P'xi)ls). The iterative reweighting procedure is 
only guaranteed to converge for convex p. p -Functions associated with 
redescending y/ functions (such as the biweight one described below) are non 
convex, so, it is advisable to choose a sufficiently good starting point and 
perform a small and fixed number of iterations. 

PPR algorithm proves to be greatly influenced by outlying points when the 
observation weights are not purposely specified as robustness weights. This 
general non-robustness is essentially due to the adopted criterion of fit and 
smoothing procedure. The algorithm estimates the model (1) parameters by 
least square (2) and approximates the smooth functions /^, m=l,2„,,M, for 
each selected projection direction, using a non-robust variable span smoother, 
called the supersmoother. This smoother is based on a locally hnear fit and 
adopts least squares as criterion of fit. 
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Roosen and Hastie (1994) have proposed an alternative PPR algorithm in which 
the smooth functions are estimated by smoothing splines. As the 
supersmoother, smoothing splines are not robust to outlying points. 

We propose to obtain robust estimates of the PPR model parameters by 



n 


Y M ^ 


/] 


minimising the tapered criterion 


yj — E]y ]— ^ Pmfm i^m ) 


r 




A J 


/ J 



The minimisation is carried out, as in the robust linear regression context, by an 
iterative reweighting scheme (PPR-1) which consists in the following steps: 



0. Fit a PPR model with no robustness weights: starting values for the residuals 

j and the scale estimate j computed. Robustness weights 
are then obtained. 

1. Step j: fit a PPR model with robustness weights (wp 

2. If /l W where 

^ i=l i=l / i=l ^ 

^ is a user-specified threshold (usually 0.005), or if j > max it , where maxit 
is a user-specified maximum number of iterations (3-5), stop; otherwise 
update the residuals and the scale estimate, obtain robustness weights 



(„F)) 



and loop back to 1. 



Among the different proposals developed for the weight function, we consider 
Tukey’s biweight one, defined by: 

|z|<^ (5) 

= 0, |z| ^ k 

where ^ is a “tuning” constant. The robustness weights are then defined by 
w/ =w{rils), where s is the median absolute residual. 

The bisquare weight function (5) has been shown to perform well for robust 
linear regression (Gross, 1977) and has been employed by Cleveland (1979) to 
obtain a robust smoothing technique based on robust locally weighted 
regression. Gross suggests the value k=6 to protect against outliers. The same 
value is considered in Cleveland (1979). 

In the context of non-parametric regression, M-estimate approaches have been 
also suggested to give rise to robust versions of kernel and spline smoothing 
(see Simonoff, 1996, for update references) and of additive models (Hastie and 
Tibshirani, 1990). 
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3. Example and conclusions 

The non-robustness of PPR is shown by a simple example described by 
Friedman and Stuetzle (1981) to stress the capability of PPR to model 
interactions between predictors. A sample of 400 observations is generated 
according to the model: 

Y = g{x)+e=g{Xi,X 2 )+e = XiX 2 +e, (6) 

with (X; , X 2 ) uniformly distributed in i-uf- 

In case A we assume £^N{()j0.2). In this situation model (6) is expressed as a 
PPR model with two terms (M=2), with estimated coefficients 

aj={0J339 -0,6792), a2={0.7159 0.6982), ^j=0,1997, ^2^^.2207, 
and estimated smooth functions m-1,2 showing purely quadratic shapes. 
The considered model is then contaminated by taking 

e '-{l-5)N{p,0.2)^-5N{(),2), with 5 = 7 with probability 0.25 and zero 
otherwise (case B). The two projection directions selected by the standard PPR 
algorithm are no more correctly determined {6l i ={f).4009 -0.9161), 
6c 2 = {0.5772 0.8166)). The corresponding estimates for m=l,2, are 
Pi =0.2307 and p2 =0.1878. 

The estimated functions are not as smooth as in the no outlier case resulting 
attracted by outlying points (Figure 1). 





Figure 1.- m—U2, obtained in case A (denoted by I ) and in case B 

(denoted by 2) by the standard PFR algorithm. 



CO 

O 
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The iterative reweighting procedure, carried out with the bisquare weights (5) 
with tuning constant k=6, shows satisfactory results identifying, after j=5 
iterations, the following parameter estimates: aj={0.7476 -0,6641), 
&2 ={0,7534 0.6576) , =0.1950 and The estimated smooth 

functions essentially reproduce the behaviour described for the standard non- 
robust PPR algorithm in the no outlier case (Figure 2). Furthermore the 
described results obtained by the robust version of PPR in the outlier case, are 
very similar to those obtained in absence of contamination. 




Figure 2: m = 1,2, obtained in case A (denoted by 1} by the standard FPR 

alt^orithm and in case B (denoted by 2} by PPR-L 



To evaluate the accuracy of our PPR iterative algorithm in approximating the 
true function, g(X), 100 samples, each of 400 observations, are generated 
according to model (6) either in absence (case A) or in presence (case B) of 
outlying points, with X values generated as described above and kept fixed 
throughout the simulation study. 

The mean squared error (MSE) is used as a measure of goodness of 
approximation 



7 ^uu 

= — (7) 

with g{xi) denoting the estimate of g{xi), i=l,2,...,400. 

In both cases (A and B) the performances of PPR and PPR-1 are compared on 
the basis of the mean of the MSB's (MMSE), averaged over the 100 replications, 
and of its standard error. 
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As shown by the simulation results, reported in Table 1, our iteratively 
reweighted algorithm outperforms the standard one in the contaminated 
situation, while both procedure perform very similar, in terms of MMSE, in 
absence of outlying points. In both cases PPR-1 demonstrates a better stability 
than PPR. 



Table 1: Simulation results on 100 replications 





Case A 


Case B 




PPR 


PPR-1 


PPR 


PPR-1 


MMSE 


0.0027 


0.0024 


0.0261 


0.0040 


seiMMSE) 


0.0003 


0.0001 


0.0014 


0.0002 



The simulation results stress either the robustness of our proposed algorithm, 
whose satisfactory approximation capabihty is very slightly influenced by the 
presence of points which are far from the regression surface, or the general non 
robustness of standard PPR. 

In an exploratory stage, it is, then, advisable to compare the robust and non- 
robust version of PPR to detect the presence of outlying points. If both 
procedures perform similarly, then, the original PPR algorithm is, of course, to 
be preferred. 
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Abstract: In previous papers we proposed to apply multivariate statistical 
methodologies, like Multidimensional Scaling (MDS) and Seriation to the stock 
location assignment problem of a warehouse, often solved by considering the 
Cube per Order Index (COI). In this paper we compare the results by MDS, 
Seriation, a COI based method and the Maximum Path criterion, considering the 
data of a whole year of a Sicilian supermarket chain warehouse. The comparison 
is based on the simulated times to satisfy a sample of real orders. 

Key words: stock location assignment, MDS, seriation, minimum spanning tree, 
maximum path criterion, COI, simulations. 



1. Introduction 

The problem of the best stock location assignment in a warehouse has a 
fundamental role while optimising picking activities. Among the different 
components of times constituting this activity (Mineo, Plaia, 1997 a), the time to 
reach the first picking position from the Input/Output (I/O) position (and return) 
and to reach all the picking positions, represents the most important component: 
an improvement in stock location assignment can reduce significatively this 
component of time, reducing in this way warehouse costs. 

In Mineo & Plaia (1997 a, c), multivariate statistical methodologies have been 
applied for the first time in order to optimise stock location assignment; the 
results by applying MDS and Seriation show the usefulness of such 
methodologies, even if applied to the data of a warehouse of a Sicilian super and 
hypermarket chain where a sub-optimal solution has been gained thanks to 
human experience and know-how. In Mineo & Plaia (1997 b) it has been shown 
how the MDS approach is very competitive and often better than the methods 
usually used to cope with such a problem (generally Operational Research 
techniques as for the travel salesman problem): in particular, the MDS approach 
has been compared with a method based on the Cube per Order Index, COI, 



* Authors’ names are listed in alphabetic order. 
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(Caron, Marchet, 1994), that is a very popular method. 

In this paper we introduce the use of another multivariate statistical technique to 
solve the considered problem, i.e. an extension of the Maximum Path criterion 
(Scippacercola, 1997), based on the determination of the Minimum Spanning 
Tree (Gower, Ross, 1969), and we compare the results obtained by applying this 
criterion with the MDS, the Seriation and the COI approaches. An important 
feature of this paper, differently from our quoted papers (Mineo, Plaia, 1997 a, b 
and c), is that the data of the considered warehouse are relative to a whole year. 
In this way we can introduce a new variable to deal, in an appropriate way, with 
seasonal goods and, by comparing the obtained results by means of simulations 
based on real picking order lists, we can select the best methods, for our 
considered warehouse, among the proposed ones. 



2. The compared methods 

In order to apply the MDS, seriation and maximum path criterion a dissimilarity 
matrix has been considered: this matrix has been computed by considering the 
Euclidean distances by among the classes and the variables considered, as will be 
discussed in the next paragraph. In this paper we use the non-metric MDS 
algorithm (Schiffman, Reynolds, Young, 1981) implemented in the 

STATISTICA® package and consisting in the minimisation of the raw-stress 
function: 



raw-stress = X[^ij ~ f(Sij)| (^) 

ij 

where dy are the reproduced distances, given the respective number of 
dimensions, and by are the input data, i.e. the observed distances. The 
expression f(bjj) indicates a non-metric, monotone transformation of the 

observed distances. Thus, the program will attempt to reproduce the general 
rank-ordering of distances between the objects in the analysis: in this way we can 
find an arrangement for placing the goods in the warehouse. For the reason to 
choose non-metric MDS see Mineo &Plaia (1997 a). 

To solve the considered problem we can also use a seriation algorithm, such as 
that proposed by Wright (1985): in fact, by considering the way the warehouse 
aisles are crossed, i.e. by following a traversal policy (Caron, Marchet, 1994), 
we can think to our problem as placing objects along a continuum; so we have to 
minimise the loss function: 

LF(x)= -Xj|) 

i<j 



( 2 ) 
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over X = (xi, X 2 , . . x„), where Xi is the coordinate of object i on the axis and 6ij 
has the usual meaning. 

Finally, by considering the maximum path criterion we can get another layout for 
the k-dimensional points. This criterion consists in finding the path of maximum 
length M in the Minimum Spanning Tree (MST) (Ross 1969). In this paper we 
follow the approach proposed by Scippacercola (1997) consisting in putting the 
points ph lying on aside edges of M inside the path at a distance from the 
extreme vertices proportional to the ratio in the MST, that is by using the 
relation: 



g(Po, Ph) = 



d(Po.PH)^^(Ph^Pj^ 



( 3 ) 



where po and pz represent the extreme points of the Maximum Path in the MST. 
Besides these statistical methods, we considered also a method based on COI, 
defined, for each good, as the rate between the assigned volume on the shelf and 
the picking rate: if the assigned volumes are equal, increasing COI values 
correspond to decreasing picking rates. 



3. The data set 

In order to deepen better the problem described in the precedent section, the 
data of the warehouse of a Sicilian chain of super- and hypermarkets have been 
considered. In the warehouse about 1500 different goods are stocked; these can 
be clustered in respect of good affinity in about 25 classes, according to what 
suggested by the Organised Distribution Network whom the Sicilian chain 
belongs to. In our quoted papers, considering just the number of item^-per-good 
picked in the period of time 1 .7. 1996 to 3 1 . 12. 1996, we had already perceived a 
lack of homogeneity inside some classes, so 44 smaller classes were obtained, by 
cutting off some goods whose behaviour, according to the considered variable, 
was quite different from the other goods in the class. 

In the present paper we consider these classes and, besides the variables already 
considered, that are: 

1 . the number of picked item-per-good 

2. average volume of the item in the class 

3. item fragility 

4. -6. average number of items per good per order, for each of the tree different 

kind of supermarket in the network 
7. number of goods in the class; 

a new variable, coded by the authors, has been added. In fact, as the period of 
observation is now a whole year (1.7.1996 to 30.6.1997), a new variable can be 



^ Item: smallest pickable quantity of good. 
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considered: 

8. the seasonal component. 

In order to compare the effectiveness of the proposed methodologies for the 
stock location problem, a C language simulator has been developed. 

In particular the four proposed solutions are compared with the current 
allocation, by drawing a sample of real orders, received by the warehouse from 
each super/hypermarket in the chain. The time to satisfy each order has been 
computed. 

It is important to highlight that the current allocation is not a starting solution, 
being actually the results of years of experience and improvement. 

By computing the travelled distances to satisfy the sample of real orders, we can 
choose the best approach to this problem for our specific warehouse. 



4. Results 

The simulator has been used to compute the distances travelled in order to 
satisfy a sample of 15 orders for each of the 37 stores of the supermarket chain. 
The 37 stores are the only retail stores in the chain. 

The sample of orders covers the same period of time of the described dataset. 

In figg. la and lb the following solutions are compared: 

1 . the current stock location assignment 

2. the COI-based solution 

3. the non-metric MDS solution 

4. the Maximum Path in the Minimum Spanning Tree solution 

5. the Seriation solution. 

It is evident how the current solution is not competitive, being worse than the 
four ones. The COI based solution seems also worse than the others, even if the 
difference is not so evident in any case. 

Figure la: Average distances travelled (m) by the operator to satisfy a sample 
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of orders for each of the first 18 stores of the chain with the five techniques 
Figure lb: Average distances travelled (m) by the operator to satisfy a sample 
of orders for each of the second 19 stores of the chain with the five techniques 




The other three solutions seem to be comparable, but the MST probably 
produces the best one, even if no tests have been carried out. 



5. Conclusion 

Different statistical methodologies have been compared in order to improve the 
current stock location assignment. The results show that all the solutions 
perform better than the current one, and, moreover, the statistical methodologies 
seem better than the most popular method to deal with this problem (the COI 
index). 

The new variable (seasonal component) does not change the behaviour of the 
MDS and Seriation solutions each other (see Mineo & Plaia 1997c). 

The new tested methodology, based on the modification of the Maximum Path 
in the Minimum Spanning Tree proposed by Scippacercola (1997), results to be 
very interesting and easily adaptable to treat the stock location assignment 
problem. 

In conclusion, all the proposed statistical methodologies seem to solve the stock 
location assignment problem in a proper way, giving even better results than 
COI. 

Complications could arise if we would consider each good, instead of classes of 
goods, as a unit (in this case the COI method can be easily applied). The 
distance matrix would be difficult to handle, and also for this reason we 
preferred to deal with classes of goods. Anyway, while for small warehouses, 
the distance matrix can be computed for the single goods, so the methods can be 
used to get the order of the goods and not of the classes, in large warehouses the 
management prefers to keep aside goods belonging to the same “class”, 
therefore the application of the proposed methods to the distance matrix of the 




578 



classes is both justified and competitive. 

Concluding the tested multivariate techniques can be easily and effectively 
applied in any case to cope with the stock location assignment problem. 
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Abstract: In this paper a class of models for conditionally Gaussian space- 

time processes is proposed in a state-space setup. These models have two main 
purposes: describing and forecasting the temporal evolution of the observed phe- 
nomena. Observed processes are assumed to be the sum of unobservable compo- 
nents, such as a time varying level, a periodic component, a stationary autoregres- 
sive component and a measurement error. We give sufficient conditions for model 
identifiability and show how Bayesian analysis can be performed via Gibbs sam- 
pling. We present the results of an application to some simulated spatial time 
series. 

Key words: state-space models; spatial correlation; signal to noise ratio; Gibbs 
sampler. 



1. Introduction 

Environmental data are usually collected over space and time. A spatio-temporal 
analysis can provide benefits not possible either from a spatial-only approach or 
from the modelling of site specific time-series. Analyzing individual time series 
separately, for instance, information about spatial interaction or about common 
features among different sites is lost. On the other hand, a merely spatial analysis, 
ignoring the underlying dynamics of the observed phenomena, might give mis- 
leading results. This is why it is important to analyze both spatial and temporal 
features of data simultaneously as accurately as possible. 

In Tonellato (1997a) we defined a new class of models for spatial time series, 
which extends the well-known Bayesian dynamic linear models (West and Har- 
rison 1989) to the domain of the analysis of space-time stochastic processes. We 
believe that such models can be a very useful tool to be used in environmental 
monitoring. In fact, they can give both a statistical description of the evolution 
of the observed processes and reliable temporal forecasts, taking into account the 
spatial interaction among time-series collected over a monitoring network. In 
this paper we focus our attention on the estimation of unknown parameters using 
the Gibbs sampler. In section 2 we give a brief description of the models we are 
dealing with and in section 3 we present the results obtained in some simulations. 

‘This research has been partially supported by M.U.R.S.T. (Ministero dell’ Universita e della 
Ricerca Scientifica e Tecnologica). 
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2. The model 

Suppose that observations come from an incomplete sample collected on a fixed 
region, S = {si\ Si e V C 'M? , z = 1 , . . . , n}, independent of t. In such a con- 
text, a spatial time series is defined as an n-variate time series y^, t G Z. For any 
fixed t, the n-dimensional vector is a partial realisation of a spatial stochastic 
process defined on V and observed on <S. The class of models proposed here 
assumes that yt can be represented in terms of unobservable components, each 
describing salient features of the dynamics of the process. In their most general 
form, such models give the following representation of the observed process: 

+ 7 (+Xt + £(, VteZ, (1) 

where and 7 ^ denote respectively a temporal trend and a temporally non sta- 
tionary periodic component, such as a seasonal or daily periodicity. For any 
fixed t, and 7 ^ might be modelled as spatial processes, or as scalar valued 
components common to the n sites, or even as n-dimensional vectors of i.i.d. 
random variables; is a temporally stationary autoregressive spatio-temporal 
component; St denotes an erratic term, distributed as a Gaussian white noise 
GWN{0, cflln) (In denotes the identity matrix of order n). A state-space rep- 
resentation of these models is possible and is defined as: 

y, - FCt + Su St^GWN{0,aX) (2) 

= GC_i+Bu, u, - GfFiV(0, SJ , (3) 

where F G Ct ^ G G G (/c' < /c), B G More- 
over, we shall assume that £t _L Ut' Ut _L Cz' ' Generally, 

and Su are unknown, whereas G is only partially known, so estimation prob- 
lems arise. In a previous work (Tonellato 1997a) we gave sufficient conditions 
for model identifiability (O’ Hagan 1994) in order to guarantee likelihood infor- 
mativeness, avoid overparameterisation problems and reduce the dependency of 
the predictive distribution on parameter prior. In the same work we considered 
the inclusion of explanatory variables. 



3. Results of a simulation study 

We simulated some spatial time series from the following model with n = 5 and 
1 < f < 500 : 

yz = Mtls -h 6^, ~ GH7V(0, CTgln) (4) 

(the scalar time varying level fj,t is common to the n sites; In indicates the n- 
dimensional unit vector), with 

/Xt = Ht-i+Vt, rjt^GWN{0,alln) 
xt - ~ GWW(0, S„R(0)S^) 



(5) 

( 6 ) 
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Table 1 : Parameter values and estimated marginal posterior percentiles. 



Parameter 







value 


90.05 


90.5 


90.95 




(pi 


0.5 


0.337 


0.510 


0.656 




(p2 


-0.7 


-0.786 


-0.685 


-0.570 




03 


0.2 


-0.210 


-0.029 


0.134 




04 


-0.4 


-0.613 


-0.433 


-0.367 




05 


0.7 


0.561 


0.780 


0.932 






0.01 


0.007 


0.010 


0.014 


Ml 


0.02 


0.008 


0.023 


0.031 




1.41 


1.392 


1.477 


1.565 






1.41 


1.304 


1.416 


1.509 






1.41 


1.261 


1.354 


1.435 






2.00 


2.116 


2.258 


2.445 




C^uJ,5 


2.00 


1.938 


2.118 


2.260 




a 


0.95 


0.872 


0.918 


0.979 




K 


0.1 


0.084 


0.106 


0.129 




01 


0.5 


0.238 


0.408 


0.571 




02 


-0.7 


-0.876 


-0.778 


-0.680 




03 


0.2 


-0.099 


0.017 


0.154 




04 


-0.4 


-0.568 


-0.398 


-0.228 




05 


0.7 


0.558 


0.650 


0.760 




1 


0.01 


0.008 


0.011 


0.015 


M2 


0.04 


0.014 


0.032 


0.046 




1.41 


1.205 


1.302 


1.380 




C^u;,2 


1.41 


1.370 


1.481 


1.582 




C^u;,3 


1.41 


1.268 


1.344 


1.443 




<^u;,4 


2.00 


1.742 


1.904 


2.092 






2.00 


1.565 


1.709 


1.852 




a 


0.95 


0.836 


0.893 


0.936 




K 


0.1 


0.073 


0.087 


0.104 




01 


0.5 


0.151 


0.380 


0.586 




02 


-0.7 


-0.859 


-0.753 


-0.636 




03 


0.2 


-0.094 


0.016 


0.200 




04 


-0.4 


-0.612 


-0.418 


-0.153 




05 


0.7 


0.485 


0.655 


0.784 




i 


0.01 


0.006 


0.008 


0.012 


M3 


0.08 


0.055 


0.074 


0.088 


<^u;,l 


1.41 


1.360 


1.492 


1.612 




<^u;,2 


1.41 


1.344 


1.450 


1.595 




<^u;,3 


1.41 


1.204 


1.294 


1.368 




<^u>,4 


2.00 


1.837 


2.038 


2.277 




C^u;,5 


2.00 


1.841 


2.046 


2.210 




q; 


0.95 


0.753 


0.829 


0.914 




n 


0.1 


0.084 


0.092 


0.109 
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Figure 1: Plots of the 5-th, 50-th and 95-th percentiles of simulated values of a 
and K versus iterations of the Gibbs sampler (dotted lines) and true parameter 
values (bold lines). 

a K, 




where ^ = diag[(/)i, (/> 2 , </> 3 , 04 , 05 ] is a diagonal matrix containing the unknown 
autoregressive coefficients. For any fixed t, ujt has been defined as a Gaussian 
spatial process. The matrices = diag[(Ju,,i, cTu,, 2 , ctw, 4 , cTu;, 5 ] and R(0) 
contain the site specific standard deviations and the spatial correlation coefficients 
of u)t respectively. The spatial correlation function of has been specified as an 
isotropic exponential one (Cressie 1993), i.e. 



Corr{uJt{si),uot{sj)) = aexp{—K,dij}, 0 < < 1 ,/^ > 0 , 9 — [a k]' 

where dij denotes the Euclidean distance between sites Si and Sj, 
i,j = 1, 2, . . . , 5. It is easy to verify that the model defined by eqs. (4)-(6) has a 
state-space representation as the one introduced in eqs. (2)-(3) with 

F = [I5 I5]', Ct = 

The unknown parameters are 0^, z = 1, . . . , 5, cr^, a and n. 

It is well known in the dynamical systems literature that measurement error 
variability negatively affects both filtering (i.e. the estimation of the unobservable 
state variable and parameter estimation. In our simulation study we want to 
assess how much measurement error variance can affect our inference. There- 
fore, we generated three independent series, denoted by Ml, M2 and M3, from 
the model defined by eqs. (4)-(6) and characterised by different values of aj, 
which has been fixed equal to 0.02, 0.04 and 0.08 respectively. The values of 
the remaining parameters have been held constant and their values are reported 
in Table 1. An indicator of the weight of measurement error variability in the 
models we are dealing with is given by the ratio between and the variance of 
the stationary signal (we call this quantity signal to noise ratio and denote it by 
Smi, Sm2 and Sms)- Clearly, due to the spatial heteroschedasticity of x^, this 
ratio is not constant over the different sites. It is easy to verify that in the three 
series we generated, we have: 0.02 < Smi < 0.0672, 0.040 < Sm 2 < 0.134 and 
0.089 < Sms < 0.269. 



1 0 
0 # 



B — In+l, and Ut = 



It 
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The unknown parameters are assumed to be independent a priori. The prior 
distribution elicited on al is an /G(0.5, 100) for Ml and M2 (IG indicates an 
Inverted Gamma distribution) and an 7G( 1.2, 12) for M3. The priors elicited on 
the remaining parameters are the following: 

Co~iV(0, lOOIe), 100), 

a~G(0, 1), ac-G(0, 1), 

(0, 0.5) l(_i,i), - 7G(2.5, 1), z - 1, . . . , 5. 

The normal prior for is truncated on the interval (—1,1) in order to guarantee 
the stationarity of x<; the prior distributions of and al seem quite concentrated 
on small values. Actually, the former is appropriate for a smoothly time varying 
level, which is usually the case for environmental time series, whereas the latter 
allows for high values of the signal to noise ratio. Suppose our interest centers 
on the estimation of all the unknown parameters appearing in the model with the 
exception of state variables. The whole parameter vector has dimension 3020. 
So we have to cope with a very complex parameter! sation and marginalise the 
joint posterior with respect to a 3006-dimensional nuisance parameter. In spite 
of the high number of parameters, it can be shown that the model is identifiable 
(Tonellato 1997a) if ^ is nonsingular zr-almost surely (tt denotes the prior joint 
distribution of the parameters). Clearly, the complexity of the model does not 
allow us to compute analytically the posterior distribution of the parameters of 
interest, so we used an approximate version of the Gibbs sampler (Gelfand and 
Smith 1990). For the definition of the full conditional distributions required by 
the Gibbs sampler refer to Tonellato (1997a). Here we want to point out that, in 
order to sample values of a and k from their full conditionals, we used an approx- 
imated algorithm, the griddy Gibbs sampler (Ritter and Tanner 1992). It is worth 
to note that the same computational tools would work with different specifica- 
tions of the spatial correlation function of For each series we generated 100 
independent chains with starting values sampled from the prior distribution of the 
parameters. The estimated posterior marginal 5-th, 50-th and 95-th percentiles 
are shown in Table 1. To give an idea of the behaviour of the Gibbs Sampler, in 
Figure 1 we show the plots of the percentiles of the simulated values of a and 
K versus iterations of the Gibbs sampler for Ml. We consider these parameters 
explicitly because they are probably the most difficult to be estimated, since they 
appear in the spatial correlation function of the noise of the unobservable space- 
time autoregressive process x^. The marginal posterior 95% confidence intervals 
of the autoregressive coefficients contain the true value of each parameter, with 
the exception of (j)^. In fact, 03 is quite close to the upper bound of the confidence 
interval. The behaviour of the marginal posterior percentiles of i = 1 , . . . , 5, 

is particularly interesting. The posterior 95% confidence intervals are sensitive 
to the spatial heteroschedasticity of a;^, although they do not always cover the 
true values of the parameters. In any case, such distortions are quite small and 
they probably would disappear for higher sample sizes. As far as cr^ and are 
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concerned, these parameters are always covered with high precision by the pos- 
terior confidence intervals. Finally, it is worth noting that the analysis is quite 
robust to changes of the signal to noise ratio, which means that these models can 
capture important aspects of the spatio-temporal dynamics of the process even 
when measurement errors are not negligible. 



4. Discussion 

We have illustrated some aspects of a new class of models for spatial time se- 
ries with particular emphasis on parameter estimation. It can be shown that such 
models are identifiable in spite of the high number of parameters. The results 
obtained in some simulations support the idea that reliable information about the 
unknown parameters can be achieved through the posterior analysis. In particu- 
lar, the dynamics of an unobservable space-time autoregressive process can be 
analysed at a good level of precision. 

The interested reader can find an application to real data in Tonellato (1997b). 
Here, we want to recall some interesting aspects of that work. We analysed atmo- 
spheric carbon monoxide concentrations by defining a nonstationary space-time 
model and assuming a lognormal distribution of the data conditionally on some 
unknown parameters. The site specific temporal forecasts resulted quite reliable 
within a forecast horizon of 24 hours. Moreover, a retrospective analysis of the 
unobservable state gave a statistical explanation of the temporal heteroschedas- 
ticity of the observed process at a particular site. 

Possible extensions to discrete data and to spatial prediction problems are 
under way. 
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Abstract; This paper deals with the problem of evaluating dissimilarities between 
trajectories in a three-way longitudinal data set (a set of multiple time series). The 
dissimilarity between trajectories is defined as a conic combination of the 
dissimilarities between trends, velocities and accelerations of the pair of trajectories. 
The coefficients of the linear combination are estimated maximizing its variance. The 
proposed methodology is applied on a real data set to classify trajectories of Italian 
regions by their employment dynamic. 

Key words: Trajectories, Three-way data. Trend, Velocity, Acceleration, 

Dissimilarities. 

1. Position of the problem 

Let he k quantitative variables observed on n units (objects) at u 

consecutive time points. The observed data can be arranged into a three-way 
longitudinal data set X= ^ ^4 t e/, / eJ, t e/j, 

where is the value of the /-th variable collected on the /-th object at time t; 

J = |l,...,A:|, and r=|l,...,w| are the set of indices pertaining to objects, 

variables and time points, respectively. The observed objects can be represented as 
points of a vectorial space equipped with a distance, i.e. a real function S on the set 

^ ^ ^ ^4 ^ e r from: YxY to such that: 

= SXiJ)>0. SXiJ) + SXKl)>SXKh\ V/,/,Ag7, 

where indicates the distance between object / and / of / at time f. Let 

^*+1 = d) be the metric space spanning the k variables and time. 

For each object /, T(/)= {y,;:/ describes a time trajectory of the /-th 
object according to the k examined variables. The trajectory Y{i) is geometrically 
represented by w - 1 segments connecting u points y .^ of (Figure 1). 

This paper deals with the problem to measure the dissimilarity between trajectories in 
a three-way longitudinal data set. 

The point subscript means that the vector ^ has elements taken according to the relative index 
substituted by the point. 
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Figure 1: Two time trajectories in 




The problem to evaluate a dissimilarity between Y{i) and Y{1) was first introduced by 
earlier (1986), who considered as measure the result of two weighted components 
combined using subjective weights. In this paper the dissimilarity between time 
trajectories is defined as a linear combination of distances between trends, velocities 
and accelerations of a pair of trajectories. The coefficients of the combination are 
obtained maximizing its variance. 

The {nxri) matrix of dissimilarities between time trajectories can be used for 
successive multivariate analyses. For example, it has been used to classify trajectories 
into homogeneous sets with a clustering algorithm (Carlier, 1986), or to represent 
trajectories as points of an Euclidean space of reduced dimensions, with a MDS 
technique. 

Before formalizing the notion of dissimilarity between time trajectories, it is 
useful to discuss properties that such measures should hold. 

Remark 1: a dissimilarity between time trajectories Y{i) and Y{1) takes into account 
differences: between trends of objects / and /, between velocities and accelerations of 
Y{i) and Y{I). Velocity and acceleration are two trajectories’ characteristics strongly 
describing changes of objects along time. For example in velocity of each 
segment of the trajectory is the slope of the straight line passing through it; if velocity 
is negative (positive) the slope will be negative (positive) and the angle made by each 
segment of the trajectory with the positive direction of the t-axis will be obtuse 
(acute). Geometrically, acceleration of each pair of segments of trajectory represents 
their convexity or concavity. If acceleration is positive (negative) the trajectory of the 
two segments is convex (concave). 

Remark 2; a dissimilarity between time trajectories degenerates into a classical 
dissimilarity between two objects when trajectories are reduced to two points (i.e., 
u- 1 ). 

Remarks, a dissimilarity between time trajectories is a function of distances between 
multivariate objects, l\t gT . 

Let us now formalize velocity and acceleration of a trajectory. 
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Velocity of Y{i) is defined as the rate of change of /-th object position in a fixed time 
interval and indicating the direction and versus of each segment of the trajectory Y(i) 
for a given variable. 

Therefore, for each time trajectory T(/), the velocity of evolution of an object / in the 
interval from / to ^+1, denoted , is, for the y-th variable; 

In particular: <0) if object /, for the y-th variable, presents an 

increasing (decreasing) rate of change of its position in the time interval from t to 
^+1; =0 if the object / for the y-th variable, does not change position from t to 

Acceleration measures the variation of velocity of Y{i) in a fixed time interval. 

For each time trajectory 7(/), the acceleration of an object / in the interval from t to 
denoted , is, for the y-th variable: 



ijt,t+2 



In particular: Ihe object /, for the y-th variable, presents an 

increasing (decreasing) variation of velocity in the time interval from t to /+2; 
%.t +2 =0 if object /, for y-th variable, does not change velocity from t to t+2. 

Defined velocity and acceleration, we are now in position to evaluate differences 
between trends in a time point t, velocities and accelerations in a time interval. Let us 
first consider: 

Remark 4: (a) differences between trend intensities, in a time point /, of objects / and 
/ are evaluated according to a measure of distance between x.^ and x^^, / g J; (b) 
differences between velocities of objects / and /, in a time interval, are evaluated 
according a measure of distance between and 

/ = l,...,w-l; (c) differences between accelerations of objects / and /, in a time 
interval, are evaluated according to a measure of distance between 
~ 5 • • • j f +2 ) and ^iit+ 2 f ^"“l,.--,W-2. 



Considering remark 4 and the normalized Minkowski metric of order /?, the distances 
between trends in a time point /, velocities and accelerations in a time interval for 
object / and / are respectively: 







f . _i 




f . -I 






J 


^2 


j 




_>=1 



l/f 



(trends distance) (velocities distance) (accelerations distance) 

where p is an integer > 1 and;r,,;r 2 ,;r 3 are suitable weights to normalize distances. 



( 2 ) 



Acceleration must be computed on two time intervals [r, t+i], 
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Remark 5; note that, before computing Minkowski metric, variables have to be 
standardized to take into account different units of measurements and/or variability 
(Milligan, Cooper, 1988). 

Remark 6; weights and may be defined as the range of respective distances 

or their standard deviation. 

Remark 7: (a) a dissimilarity between trends of Y{i) and Y{I) is a mapping from 
(/,/), / Erjto 9?^; (b) a dissimilarity between velocities of Y{i) and Y{1) is a 

mapping from r+i(^/X ^ e rjto (c) a dissimilarity between accelerations of 

7(0 and 7(0 is a mapping from ^+2(^0. ^ g rjto 

2. Dissimilarity between trajectories 

Following remark 1 for each time trajectory characteristic (trend, velocity, 
acceleration), three mappings are considered: minimum, maximum and sum: 

where:/w, r = 1,2,3; / = !,.. .,w„ is a function of distance ^<5^, ,+ri(X0 

u~2 

In particular 

The total dissimilarity between 7(/) and 7(0 according to the three mappings is: 

n,d\i,l)=Y,mrrJriid),m=\X^ ( 2 ) 

r=l 

where ^nY X2,3) are suitable positive weights, that indicate the contribution 
of each trajectory characteristic to determine the total dissimilarity 
The weights may be chosen by assigning arbitrary values according to a subjective 
criterion. A more objective approach is to evaluate weights so that the variance of (2) 
is maximized as show below. 

Let „D* =[„</*(/,/)] (w=l ,2,3) be the (nxn) matrices of total dissimilarities 

between trajectories and let =[,„4(^0]('^^ = 1A3) be the (nxn) matrices 
obtained considering the dissimilarities between 7(/) and 7(/) for the three (r = 1,2,3) 
features of each trajectory. Matrices J)* and may be represented as vectors 
defined by elements of a triangle below (above) the diagonal of the matrices \;ec(jy)^^^ 



E.g„ = („ rf'(l,2),..„ JXl,n), J‘(n -In)) ■ 
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and vec(^D^ ) (m,r = 1,2,3) . Therefore, ^ D = )j = [ ), X )] 

(m,r = 1,2,3) is a (n(n-\)/2 x 3) matrix; while ^ 

In order to compute the total dissimilarity between each pair of time 
trajectories, vec(^D*), the follow problem with respect to variables 

r 

»,Y =(„ 7 i,„r 2 >»r 3 ) has to be solved: 

3 

max var(vec(„D*)) = var(^ „,7.vec(„D J) 

,»Y'„y = 1 

The solution is easily obtained differentiating the Lagrangian function of (3) and 
equating to zero: 

^[var(ve^(„D*)) + /l(l-„Y'„Y)] = 0, (4) 

where A is the Lagrange multiplier. Since var(vcc(^D*))=„j'S , where Z is the 
variance and covariance matrix of ^D; thus equation (4) is: (Z - AI)^ / = 0 . 
Then multiplying for ^y' - my ^ ^ corresponding to the maximum eigenvalue of 
Z and ^y is the associated eigenvector. 



3. Clustering Italian Regions by their employment dynamic 

For every Italian Region (Table 1), the employment level for each year (1951, 
1961, 1971, 1981, 1991) (data were collected from Census of Agriculture and 
Census of Industry and Services, 1ST AT) and for different types of industries were 
considered (Table 1), defining the three-way dimensions (19 x 18 x 5). 



Table 1: Italian Regions and Industry 



Italian Regions 


Industry 


Piemonte (1), Valle d’ Aosta (2), Lombardia (3), 
Trentino (4), Veneto (5), Friuli (6), Liguria (7), 
Emilia Romagna (8), Toscana (9), Umbria (10), 
Marche (11), Lazio (12), Abruzzo-Molise (13), 
Campania (14), Puglia (15), Basilicata (16), 
Calabria (17), Sicilia (18), Sardegna (19). 


Agriculture, Mining, Manufacturing, 

Construction, Electric, gas, and sanitary 
services. Wholesale trade. Business services. 
Retail trade, hotel, and other lodging places. 
Educational, health, and legal services, Personal 
services. Auto, trade and services. 

Transportation, Communication, Finance, 
Insurance, Real estate. Social services. 



Firstly, the data were analyzed using the Factorial Dynamic Analysis (FDA) (Coppi, 
Zannella, 1978). This strategy of analysis gives information about the structure and 
the evolution of the employment. The temporal evolution, characterizing the 
employment level of each Italian Region, was synthesized by the values of the first 
two factors of FDA, explaining the 86% of the total variance. The first factor is 
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highly correlated with the employment levels of Industry and Tertiary. The second is 
more correlated with employment of Agriculture, Mining, Construction and Social 
Service. The time trajectories can be drawn into the factorial plane considering the 
first two factors. The dissimilarity between trajectories, fixed /?=2, in the Minkowski 

metric were computed, and the matrices 3 !^,^ 3 ^,- 7 ^ 3 !^ regarding trend, velocity 
and acceleration were reported in Table 2. To normalize dissimilarities, 
=m 3 t( 34 (/,/)X ^2 ^3 were used as indicated in remark 6 . 



Table 2 ; Dissimilarities matrices ^3 





1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


11 


12 


13 


14 


15 


16 


17 


18 


19 


1 




33.96 


41.81 


27 20 


3 33 


23.95 


18.95 


2.19 


7.15 


29.37 


24.73 


6 67 


25.55 


9.96 


16.64 


31.96 


25.46 


12.82 


24.30 


2 






75.46 


6.82 


32.65 


10.04 


15.07 


33.01 


30.42 


4.77 


9.62 


40.40 


9.19 


27.71 


20.41 


2.33 


9,60 


27.92 


12 14 


3 








68.81 


43.93 


65.57 


60.51 


43 21 


47 12 


71.01 


66.46 


35.90 


67 33 


50 13 


57 91 


73.59 


67 26 


51 68 


65,98 


4 










25.87 


3.34 


8.35 


26.23 


23.65 


2 45 


3.04 


33.61 


2.90 


20.96 


13,88 


4 81 


3 66 


21.30 


6.20 


5 












22.64 


17 86 


1.91 


4.17 


27.95 


23 17 


8.95 


23.88 


7.56 


1427 


30 56 


23 75 


9.80 


22.13 


6 














5.28 


22.97 


20.51 


5 44 


1 54 


30 38 


2 92 


17 93 


11.20 


8.04 


3 67 


18.45 


5.38 


7 
















18.15 


15.94 


10 60 


6.33 


25.39 


7 38 


13 43 


8.02 


13.12 


7 49 


14.39 


7.95 


8 


















5 64 


28 33 


23.58 


7 79 


24.33 


8.33 


14.95 


30.94 


24.21 


11.29 


22.78 


9 




















25.67 


20.83 


12 18 


21.42 


4.13 


11 10 


28.25 


21 21 


5.80 


19.30 


10 






















4.85 


35.77 


4.49 


23.00 


15.71 


2.65 


5 01 


23.21 


7.62 


11 
























31.08 


1.53 


18.24 


11.03 


7,44 


242 


18.49 


4.41 


12 


























31.82 


14.58 


22.16 


38.35 


31 65 


17.05 


30.32 


13 




























18.68 


11.26 


6.92 


1 26 


18.77 


3.36 


14 






























8.00 


25.49 


18.35 


3.04 


16.55 


15 
































18.12 


10 83 


7.59 


8.74 


16 


































7 30 


25.64 


9.83 


17 




































18.36 


2.72 


18 






































16 17 


19 










































1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


11 


12 


13 


14 


15 


16 


17 


18 


19 


1 




7 72 


10.81 


6.34 


1 68 


5.88 


4.99 


0.95 


1 56 


6 60 


5 30 


2.71 


5 77 


2 58 


3 17 


7 18 


5 82 


2 43 


5 56 


2 






18.40 


1 46 


8 37 


2 18 


3.00 


845 


7 12 


1 35 


2 63 


9.89 


2 16 


5.61 


4.96 


0 56 


1 99 


5,99 


2,77 


3 








16.96 


10 75 


16.62 


15.73 


10.09 


11.70 


17.32 


16 04 


8.94 


16.47 


13.11 


13.75 


17 87 


16.56 


12.94 


16.28 


4 










7.09 


1 22 


1 89 


7 04 


5 78 


0 82 


1 50 


8 49 


1.03 


4 27 


3 62 


0.98 


0 88 


4,71 


1.95 


5 












6.30 


6.02 


1 21 


2.06 


7.07 


5.77 


3.15 


6.23 


3.84 


4.19 


7.84 


6,62 


3.26 


5.80 


6 














1 64 


6.58 


5.12 


0.89 


0 60 


8.28 


0.48 


3.89 


3.31 


1.66 


0.83 


4,11 


1,08 


7 
















5.83 


4.63 


2.05 


1 37 


7.20 


1.68 
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To obtain weights my 3 determine the total dissimilarity the largest 

eigenvalue and the associated eigenvector of Z were computed: ;i = 2.9412 and 

/ 

3 7 = (0.5802 0.5783 0.5735) . The total dissimilarity matrix computed through (2) is 
shown in Table 3. 

Table 3: Total dissimilarity matrix 3D" 




Finally, the method of average linkage was applied to the matrix 3D". The 
dendrogram is shown in Figure 2 with three classes of similar trajectories according 
to dissimilarities in Table 3. The number of classes was fixed according to the test 
proposed by Calinski and Harabasz (1974). 

Figure 2: Dendrogram of the time trajectories regarding the employment evolution of 
the Italian Regions 
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Note: Piemonte (1), Valle d’ Aosta (2), Lombardia (3), Trentino (4), Veneto (5), Friuli (6), 
Liguria (7), Emilia Romagna (8), Toscana (9), Umbria (10), Marche (11), Lazio (12), 
Abruzzo-Molise (13), Campania (14), Puglia (15), Basilicata (16), Calabria (17), Sicilia 
(18), Sardegna (19). 
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4. Discussion 

In this paper new dissimilarities between time trajectories Y{i) and Y(J) have 
been discussed. These are defined as a combination of distances between trends, 
velocities and accelerations of Y{i) and T(/), features strongly characterizing 
trajectories. 

The concept of velocity and acceleration was already introduced by Kosmelj (1984) 
in the study of evolution of phenomena, using these characteristics to measure the 
dissimilarities between objects. Here, velocity and acceleration, are considered 
features of each single trajectory (object). 

The dissimilarity between trajectories proposed by Carlier (1986) can be considered 

e u-2 

= (/,/), acceleration 

is zero (i.e., = 0) and velocity regards constant time intervals. 
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Abstract: A brief overview is presented of techniques for the analysis of 
three-way data. Techniques will be distinguished in terms of the type of data 
they are meant to analyze, on the basis of the type of analysis pursued, and 
the three-way models are distinguished with respect to their uniqueness. 
Next, recent approaches are discussed for eliminating nonuniqueness by 
imposing constraints and rotation to simple structure. 

Keywords: three-way data, three-way models, multi-way data. 



1. Introduction 

Three-way data are data associated with three sets of units. As examples of 
three-way data one may, in behavioral research, have scores of a set of 
individuals on a set of variables at different occasions, or, with fluorescence 
spectroscopy, emission intensity of light of various excitation wavelengths by 
various chemical mixtures at various emission wavelengths. In these 
examples the three sets of units defining the three "ways" of the data are 
different. However, one also commonly encounters three-way data where 
two sets of units are equal, for instance in the case of data consisting of the 
similarities between a number of stimuli as judged by a number of judges, 
or, data consisting of several covariance matrices for a set of variables. 
Carroll and Arable (1980) denoted three-way data pertaining to only two 
different sets of units (called "modes") as two-mode three-way data, and 
those with three different modes as three-mode three-way data. The 
difference in data type leads to a first difference in techniques for analyzing 
three-way data. 

Techniques for the analysis of three-way data have their origin in techniques 
for the multivariate analysis of two-way data, notably principal components 
analysis (PCA) and multidimensional scaling (MDS). They aim at 
representations for the units of all modes in the data. A number of three-way 
techniques is based on considering three-way data as two-way data by 
combining two ways into one. For instance, one may consider / variables 
measured at K different occasions as a set of JK variables, and analyze the 
resulting I by JK data set (with I denoting the number of observation units) 
by PCA or any other two-way analysis technique, thus obtaining a 
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representation for the individuals, and for the combinations of variables and 
occasions. By combining this analysis with analyses of other two-way data 
sets constructed from the three-way data, separate representations for the 
variables and for the occasions are obtained as well. Such techniques are 
called combined two-way techniques here. On the other hand, techniques 
have been proposed that are based on models specifically designed to 
represent all modes in the three-way data, and which are consequently fitted 
to the data. These are called three-way model fitting techniques. In Section 2 
an overview will be given of both types of techniques. 



2. Analysis of Three-way Data 

In the present section a brief overview will be given of techniques for the 
analysis of three-way data. More detailed and comparative information on 
the techniques can be found, for instance, in the books by Law, Snyder, 
Hattie and McDonald (1984) and Coppi and Bolasco (1989), a special issue 
of Computational Statistics and Data Analysis (1994, Volume 18, number 1), 
a paper by Kiers (1991) and a book chapter by Rizzi and Vichi (1995). 
Below, first the combined two-way techniques will be described, then the 
three-way model fitting techniques. In the following, we assume for 
simplicity that the three-mode three-way data pertain to scores of I objects 
on J variables at K occasions, and that two-mode three-way data consist of a 
set of / by / (dis)similarity matrices obtained from K different judgements or 
occasions. 

Among the first combined two-way techniques for the analysis of three-way 
data is the points-of-view analysis proposed by Tucker and Messick (1963). 
This technique for the analysis of an (IxIxK) two-mode three-way data set 
consists of collecting the W(I-l) dissimilarities of each set in a vector, and 
analyzing the ^AI(I-l)xK matrix consisting of these vectors by means of 
PCA. The desired number of principal components is rotated to simple 
structure, and the scores on the resulting components are reordered into 
dissimilarity matrices again. Finally, each such summarizing dissimilarity 
matrix is analyzed by (two-way) MDS. Thus, by summarizing the K 
judgements (first step), and the I objects (second step), both modes of the 
three-way array have been taken into account. 

A similar method, but meant for both three-mode and two-mode three-way 
data, is STATIS (L’Hermier des Plantes, 1976, also see Carlier et al., 
1989). The method is based on first analyzing the "correlations" between a 
set of data matrices by means of PCA, thus describing the “interstructure" 
for the K units in the third mode. Secondly, the data or cross-products 
matrix associated with the first principal component, called "compromise'' is 
analyzed by PCA, now yielding a representation of the I objects and, if 
present, the / variables. As with points-of-view analysis, there is no need to 
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restrict oneself to using only one compromise matrix: D’Alessio (1989) 
proposed to compute and analyze all compromises that play an important 
role in the interstructure analysis. This idea has also been pursued in the 
FAMA approach (see Rizzi & Vichi, 1995), which, furthermore, allows to 
replace the ordinary matrix correlation coefficients (used in the first step) by 
other correlation measures. A similar approach has been proposed by Vichi 
(1997) for hierarchical cluster analysis of two-mode three-way data. 

Many three-way model fitting techniques have been proposed. We focus on 
exploratory models, thus ignoring models involving external information. 

First we describe models for the analysis of three-mode three-way data. One 
of the first such models is Tucker’s (1966) three-mode factor analysis model 
(3MFA). In 3MFA, components are found for each mode, and these are 
linked by a three-way array called "core". The data are modeled by 

p=\ q=‘i r=l 

i=l,...,I, k=l,...,K, where bj^ c*,. are elements of the 

component matrices A (/xP), B (7x0, and C (KxR), and the additional 
parameters denote the elements of the PxQxR socalled "core array", 
which indicate how the components from the different modes interact. 
Another early model is the PARAFAC model proposed by Harshman (1970, 
see also Harshman & Lundy, 1984). Like PCA, this model uses only one set 
of components for all three modes, but it differs from PCA in that the 
relative contributions of the different components are allowed to differ over 
occasions. The name PARAlel FACtor analysis refers to the components 
being proportional over occasions. In PARAFAC the data are modeled by 

r=l 

i=l,...,I, k=l,...,K, where bj^, c*,. again denote elements of 

matrices A, B, and C. The relation of PARAFAC to ordinary PCA may be 
more apparent from the matrix description of the model, which, for each 
frontal plane of the array, denoted as k=l,...,K, reads: 

X^=ADfi', (3) 

where D*, is a diagonal matrix with the elements of the it-th row of C. The 
PARAFAC model is equal to Carroll and Chang’s (1970) CANDECOMP 
model. They also show^ (p-312) that it can be considered as a version of 
3MFA in which the core is constrained to be " superdiagonal " (which implies 
that gp^ is unconstrained if p=q=r and gp^ is constrained to 0 otherwise). 
The 3MFA model cannot conveniently be described in terms of matrix 
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products. This is no problem for the interesting special case of 3MFA where 
no reduction of the third mode is attempted (hence with C=I), dubbed the 
Tucker! model by Kroonenberg and De Leeuw (1980): 

k=l,...JC. (4) 

Here the matrices Gj, express the relations between components for mode A 
and mode B, as they differ over occasions. Recently, Harshman and Lundy 
(1994) proposed a constrained version of this model by the name 
PARATUCK2, with features of both the PARAFAC and the Tucker2 model: 

\=ADl;GDtB', ( 5 ) 

where the single matrix G models the relations between components from 
the two modes, but, as in PARAFAC, differential component weights (in the 
diagonal matrices and D/) are employed as well. 

The final three-mode model mentioned here is meant for three-mode data 
consisting of (dis)similarities between row- and column-objects, observed at 
K instances. Such data can be described by "unfolding models”, positioning 
the objects in a Euclidean space such that their distance reflects the 
(dis)similarity between the objects. Thus, in DeSarbo and Carroll’s (1985) 
three-way metric unfolding model, the dissimilarity V is modeled as 

r=l 

where A and B contain the coordinates for the row- and column-objects, C 
contains dimension weights, and is an additive constant. 

The best-known model for two-mode three-way data is probably the 
INDSCAL model (Carroll & Chang, 1970), which in fact is equivalent to 
the three-way unfolding model, (6), with A=B; actually, the three-way 
unfolding model was proposed as a generalization of the INDSCAL model. 
The "scalar products" version of the INDSCAL model reveals its relation to 
PARAFAC; the i-th transformed (dis)similarity matrix S;^ is modeled as: 

§^=AD^'. ( 7 ) 

Likewise, symmetric variants of 3MFA (Three-mode scaling), of the 
Tucker2 model (IDIOSCAL), and of PARATUCK2 (PARAFAC2) have been 
proposed by Tucker (1972), Carroll and Chang (1972), and Harshman 
(1972), respectively: 

R 

S^=AY,c,P^', ^t=AG^A\ S/^=ADfiD^', k=l,...JC. ( 8 ) 

r=l 
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In all these models, the intermediate matrices and G are constrained to be 
symmetric, or even positive semi-definite. The INDSCAL model itself has 
been extended by "specificity" terms for the objects, thus incorporating one 
unique dimension for each object (see Winsberg & Carroll, 1989): 

r=l 

Finally, some models have been proposed for sets of asymmetric two-mode 
three-way data, such as sets of (asymmetric) friendship data, or brand 
switching data. Harshman, Green, Wind and Lundy (1982) proposed the 
three-way DEDICOM model simply as the PARAFAC2 model, now with G 
asymmetric. A more specialized model for the analysis of such data has been 
proposed by Rocci and Bove (1994); Here the objects are represented as 
vectors in a Euclidean space, and their asymmetric relation can be read from 
the plot, while differences between occasions are modeled as different 
weights attached to the lengths of these vectors. 



3. Methods for Fitting Three-way Models to Three-way Data 

The above models are most often fitted to the data by minimizing the sum of 
squared residuals. Least squares fitting algorithms have been proposed by 
the authors of the models, except for 3MFA (algorithm by Kroonenberg & 
De Leeuw, 1980) and PARAFAC2 (algorithm by Kiers, 1993). 

Recently, much attention has been paid to the computational efficiency of 
some least squares fitting procedures. First of all, procedures have been 
proposed for efficiently handling large data arrays. These are based on 
compressing the data array by finding good projections onto subspaces for 
these modes (so as to incur little or no loss of information), and fitting the 
model to the compressed array; the solution to the compressed array is next 
expanded so as to obtain estimates for the fiill data set (see Kiers & 
Harshman, 1997 for an overview, and Bro & Andersson, 1997, and Kiers, 
1998 for the most recent results). Furthermore, procedures for avoiding slow 
convergence (notably of PARAFAC) have been proposed recently (e.g., 
Paatero, 1997, Bro, 1998). When one of the modes is considered as random, 
the possibility arises to formulate factor analytic models for a super- 
covariance matrix, to be fitted by maximum likelihood procedures. Bloxom 
(1968) thus proposed a factor analytic version of the 3MFA model, and 
procedures for maximum likelihood fitting of the 3MFA model have been 
proposed by Bender and Lee (1978). Mayekawa (1987) likewise proposed a 
maximum likelihood procedure for fitting the PARAFAC model. Also, a 
maximum likelihood procedure is available for fitting the INDSCAL model 
(as part of Ramsay’s Multiscale program). 
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Finally, in chemometrics, closed-form procedures have been proposed that 
are very efficient, fit noise-free data perfectly, and supposedly work well in 
cases of low amounts of noise (see Bro, 1998). 

In addition to these fitting procedures, variants of the least squares fitting 
procedures have been proposed that employ special metrics (for examples, 
see Coppi & Bolasco, 1989), or optimal scaling to handle nonnumerical data 
(for three-mode three-way data, see Sands & Young, 1980; for two-mode 
three-way data, see Takane, Young & De Leeuw, 1977, proposing 
ALSCAL, a nonmetric variant of INDSCAL). 



4. Indeterminacy and Constraints 

The 3MFA and Tucker2 model are indeterminate: equally well fitting 
solutions can be obtained after arbitrary nonsingular transformations of all 
component matrices, provided that these are compensated in the core. On the 
other hand, their constrained versions PARAFAC and PARATUCK2 do 
(usually) give unique solutions (e.g., see Harshman «fc Lundy, 1994). 
PARAFAC, however, is often too restrictive for describing one’s data; 
PARATUCK2 is an interesting compromise, being less restrictive than 
PARAFAC, but nevertheless unique. Analogously, for two-mode models, 
the IDIOSCAL model is not unique, whereas its constrained version 
INDSCAL and PARAFAC2 (usually) are. Recently, it has been suggested to 
employ other constrained variants of 3MFA, notably with arbitrary sets of 
core elements constrained to zero (Kiers, 1992, Rocci, 1992), which, in 
certain cases leads to unique models as well (Kiers, Ten Berge & Rocci, 
1997), or, in combination with some constraints on the component matrices 
may lead to partial uniqueness (Kiers & Smilde, 1998). Moreover, it has 
been shown that large numbers of such zero core constraints can be imposed 
without loss of fit (Murakami, Ten Berge & Kiers, 1998; Ten Berge & 
Kiers, 1997). 

Alternatively, constraints can be used to incorporate external information on 
the data, and thus, hopefully find better estimates of the parameters 
underlying the model (for an overview, see Bro, 1998). For instance, when 
the underlying entities are known to be nonnegative, it is useful to constrain 
the parameters to nonnegativity. Also, in case one of the modes refers to 
consecutive points in time, one may expect a trend in the data, and model 
this, or only its smoothness. For descriptive purposes it can also be desirable 
to impose the above or other constraints, like equality of components, 
nonnegative correlations of components with variables, or simple structure 
on the components (e.g., see Krijnen, 1993; Kroonenberg & Heiser, 1998). 
An extreme case of simplicity is where the component matrices are 
constrained to be binary. This leads to overlapping cluster models (Carroll & 
Chaturvedi, 1995), the best-known of which is the INDCLUS model. 
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5. Solving the 3MFA Indeterminacy by Simple Structure Rotation 

In PC A, a common procedure for solving the indeterminacy of the solution 
is to rotate the loading matrix to "simple structure" (with many elements 
close to zero). A similar approach can be followed for 3MFA, and would 
have as an advantage over constraining the solution that by rotation no loss 
of fit is incurred. However, in 3MFA it is not so clear what aspect of the 
solution should be simplified. Tucker (1966) mentioned simplifying either 
the component matrices (by varimax rotation) or the core. As discussed by 
Kroonenberg (1983, Chapter 5), the first core rotation procedures aimed at 
diagonalizing the frontal core planes. Later, Kiers (1992) proposed to rotate 
the core to superdiagonal ity. The first procedure for rotation of the core, 
over all three modes, to an arbitrary simple structure was the oblique "tri- 
quartimax" rotation procedure by Kruskal (1988). This paper served as a 
starting point for the development of Kiers’ (1997b) "three-mode orthomax" 
core rotation procedure, where "orthomax" represents a family of orthogonal 
simple structure rotation methods. Recently, Kiers (1997a) proposed an 
extension of the three-mode orthomax procedure in which the core and the 
component matrices are rotated to simplicity jointly. By varying the 
importance attached to simplicity of the core and each of the component 
matrices, a very flexible approach arises, of which the core rotation and 
optimal simplicity rotation of the component matrices are special cases. 
Except for Kruskal’s procedure, the above techniques pertain to orthogonal 
rotations. Kiers (1997b) mentioned how oblique rotation can be obtained 
from three-mode orthomax. A procedure aimed explicitly at oblique simple 
structure is Kiers’ (in press) three-mode generalization of SIMPLIMAX for 
oblique rotation of the core. So far, no procedures have been developed for 
joint oblique rotation of the core and component matrices. 



6. Discussion 

A brief overview has been given of techniques for the analysis of three-way 
data, with special attention to recent developments. By concentrating on 
techniques that provide representations for all modes of the data, many 
related techniques are ignored (e.g., all techniques meant for the joint 
analysis of a number of data sets involving the same variables but different 
observation units, or vice versa), whereas such techniques may very well be 
applied successfully to three-way data. For further reading and for keeping 
up to date with respect to three-way analysis, I refer to two internet home 
pages: Rasmus Bro’s home page (http://newton.foodsci.kvl.dk/rasmus.html) 
with emphasis on chemometrics, and Pieter Kroonenberg’s home page 
(http://www.fsw.leidenuniv.nl/~kroonenb/) of general three-way interest. 
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1. Introduction 

In his 1983 annotated bibliography of three-mode analysis, Kroonenberg (1983a) 
claimed that all theoretical publications pertaining to three-mode component and 
factor analysis were included, as well as all published applications of these 
methods. The claim might be a bit overstated, but in this paper the key problem 
behind the paper is taken up again: “How does one go about collecting such a set 
of references, especially the applications?” Keyword searches in titles, keyword 
and abstract sections, as listed in large data banks such as that of the Institute of 
Scientific Information, will generally come up with all relevant technical papers, 
but not necessarily with all applications of the techniques, because authors often 
use statistical techniques without mentioning this in the abstract. However, 
techniques are seldom used without a reference to one or more key publications 
providing background to the techniques. The only exceptions to this are references 
to techniques which are so included in the lore of science that researchers 
familiarise themselves with the techniques via general textbooks or refer to the 
technique without citing anyone, the Mest is an example. The techniques for 
three-mode analysis are mentioned in several general textbooks, but virtually 
nowhere in sufficient detail to be useful to a practitioner. 

In this paper we will describe a first attempt to map the introduction and 
development of three-mode analysis in (analytical) chemistry starting with a set of 
key publications from within the discipline and a set of earlier key publications 
from the field of psychometrics where three-mode analysis originally was 
developed. Primarily the methodological opening moves will be reported here, 
rather than the real substantive results mapping three-mode analysis in chemistry 
itself, nor will the diffusion and synthesis of techniques developed within 
chemometrics and psychometrics be part of the discussion. Partly such 
information can already be found in Bro, Workman,Mobley and Kowalski (1997), 
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but in later publications we will go into these issues in greater depth starting from 
a bibliometric perspective. 

All literature searches reported in this paper were carried out via the Internet 
facility, the Web of Science® provided by the Institute for Scientific Information 
{http:! lwww.isihost.com) on a trial basis to the universities of The Netherlands. 
The trial version unfortunately only contains papers from 1987 on, so that the 
earliest history could not be traced in this way and will thus not be considered. 
The reference date for the search carried out in the Science Citation Index 
Expanded® is (approximately) 31 December 1997. 



2. Data Base 

2.1 Key Publications 

The basic idea of investigating the introduction of a new technique, such as three- 
mode analysis, via key publications is that most authors develop a technique 
building upon the work of others or apply the technique explained in certain key 
publications. Three-mode analysis is here defined as the set of techniques for the 
analysis of data which can be arranged in a block or a set of matrices, such as 
chemical compounds which are dissolved in different strength solutions on which 
certain characteristics are measured. In three-mode analysis one may build upon 
earlier results obtained in two-mode analysis or upon earlier results in three-mode 
analysis itself. For the present investigation both these sources are important. 
Certain chemical problems were such that chemometricians initially invented 
three-mode analysis for themselves only to discover that psychometricians had 
been there before them. Notwithstanding, at present, chemistry is one of the most 
vigorous areas of new developments in the field, often in close co-operation with 
psychometricians as is evident from joint conferences (TRIG, see Geladi, 1993; 
TRJCAP, see Bro & Kiers, 1997) and publications. In order to discover all or most 
of the chemical publications in three-mode analysis, one has to recognise the two 
sources mentioned. In addition, the key publications should be sufficiently general 
that one may expect that a paper seriously discussing or applying the techniques 
should quote at least one of them. On the other hand, the set should be small 
enough to be manageable, and the publications should not be cited for substantive 
reasons. It is of course impossible to avoid this completely, so that a certain 
amount of manual checking is unavoidable. Finally, of some authors we have 
designated more than one paper as a key publication, even if not all of them are 
highly cited, because it often seemed that authors made a choice between them, in 
our eyes not necessarily the proper choice. When tracing the dissemination of 
ideas, authors’ awareness of certain work was considered more important than the 
exact citation count. 

On the basis of exposure to the literature, the two sets of key publications listed 
in Table 1 were chosen for the present project. From the list of psychometric 
papers, no papers by Kiers (University of Groningen) are included 
notwithstanding his wealth of papers in this area. This is primarily because he has 
written very few overview papers to which people naturally refer, and authors 
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citing him generally also cite at least one of the key publications listed in Table 1. 
Thus even though his work is vital for the further development of the field, in 
general his publications seem to be too technical to be used as standard references. 

Conspicuously absent in the set of chemistry key publications are the two 
papers based on the unpublished doctoral thesis by Spanjer, i.e. De Ligny, Spanjer, 
Van Houwelingen and Weesie (1983) and Spanjer, De Ligny, Van Houwelingen 
and Weesie (1984). These two papers are among the first applications of three- 
mode analysis to chemical data and they also include a new algorithm for the 
three-mode principal component model, which has never been published 
elsewhere. Unfortunately, these papers have not caught the attention of the 
chemistry community, be it that they are now referred to in recent survey papers. 
Three other papers have the potential to become key publications (Henrion, 1994; 
Bro, 1997; Bro, Workman, Mobley, and Kowalski, 1997), but they are too recent 
to have collected citations by papers which do not cite one of the key publications 
already present. At the same time, it is expected that the papers by Tucker will 
gradually fade into the background, especially his 1963 and 1964 ones, because 
for newcomers to the field, later papers will effectively act as a screen for his 
seminal work. Some of this is already under way, and one especially expects to 
see this in applied papers. 

To obtain confirmation that the chemical papers mentioned are indeed key 
publications in this area, their citation count was compared with the 120 entries in 
the complete list of publications retrieved from all key publications listed in Table 
1. On this list there were several papers with higher citation counts than the six 
key publications. However, two were review papers discussing both three-mode 
and other techniques and they are thus not eligible. Another two papers from the 
group around MacGregor and Nomikos at McMasters University, Canada, were 
especially quoted for their applications. High citation counts were also found for 
papers by the prolific group around Roma Tauler at the University of Barcelona, 
but inspection showed that they contained V 2 to V4 self-citations, and could thus not 
be considered key publications in the area. Within the psychometric field, no 
separate check was carried out for the psychometric key publications, as they are 
unquestionably at the core of the field. One of the most cited papers in three-mode 
analysis is Carroll and Chang (1970), who present, in addition to their main 
model, a model (and algorithm) for three-mode analysis which is essentially the 
same as that proposed by Harshman, whose work was only formally and fully 
published in 1984. Carroll and Chang’s paper was, however, not included in the 
present list, because all papers referring to Carroll and Chang’s work also referred 
to one of the other psychometric publications, but not vice versa. If the search 
would go further back in time, then their paper should be included because of its 
early publication data. For instance, Appellof and Davidson (1981, 1983) referred 
the paper, showing that they were aware of at least some of the psychometric 
work in three-mode analysis. 
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2.2 Completeness of the Referring Papers 

Unfortunately, it cannot be claimed that the list of citing papers is complete. To 
do so would be to ignore the unreliability of the ISI data base. First of all the 
coverage of the ISI is not complete, but its claim that it covers the world’s most 
important science journals is probably true, even though there is a strong English 
language bias in its coverage. There exists for instance a wealth of Japanese 
scientific journals which is not included. On the other hand, the impact of such 
journals is naturally limited due to their Japanese language orientation. 

A more serious problem is however the inaccuracy due to author errors in 
citation. For instance, all key publications suffer from inaccurate citations, either 
by incorrect first page numbers, incorrect years, etc. Unfortunately, the problem in 
chemistry might be more serious than for instance in the social sciences, because 



Table 1: Chemometric and Psychometric Key Publications in Three-Mode Analysis 



Author(s) 


Year 


Title 


Source 


Chemometric Key Publications 




Appellof, CJ. & 
Davidson, E.R. 


1981 


Strategies for analyzing data from video 
fluorometric monitoring of liquid 
chromatographic effluents. 


Analytical Chemistry 


Appellof, CJ. & 
Davidson, E.R. 


1983 


Three-dimensi(Mial rank annihilaticm for multi- 
component determinations 


Analytica Chimica Acta 


Geladi, P. 


1989 


Analysis of multi-way (multi-mode) data. 


Chemometrics and Intelligent Laboratory 
Systems 


Sanchez, E. & 
Kowalski, B.R. 


1990 


Tensorial resolution: A direct trilinear 
decomposition 


Journal of Chemometrics 


Smilde, A.K. & 
Doombos, D.A. 


1991 


3-way methods for the calibration of 
chromatographic systems - comparing PARAFAC 
and 3-way PLS. 


Journal of Chemometrics 


Smilde, A.K. 


1992 


Three-way analyses: Problems and prospects. 


Chemometrics and Intelligent Laboratory 
Systems 


Psychometric Key Publications 




Tucker, L.R. 


1963 


Implications of factor analysis of three-way 
matrices for measurement of change. 


In Harris, C.W. (Ed). 


Tucker, L.R. 


1964 


The extensicMi of factor analysis to three- 
dimensional matrices. 


In Gullikson, H. & Frederiksen, N. (Eds.) 


Tucker, L.R. 


1966 


Some mathematical notes on three-mode factor 
analysis. 


Psychometrika 


Harshman, R.A. 


1970 


Foundations of the Parafac procedure: Models 
and conditiois for an "explanatory" multi-mode 
factor analysis. 


UCLA Working Papers in Phonetics 


Kro<menberg, P.M. 
& DeLeeuw,J. 


1980 


Principal component analysis of three-mode data 
by means of alternating least squares algorithms. 


Psychometrika 


Kroonenberg, P.M. 


1983b 


Three-Mode Principal Component Analysis: 
Theory and Applications. 


DSWO Press: Leiden. 


Harshman, R.A. 
& Lundy, M.E. 


1984a. 


The PARAFAC model for three-way factor analysis 
and multidimensional scaling. 


In Law et al. (Eds.) 


Harshman, R.A. 
& Lundy, M.E. 


1984b. 


Data preprocessing and the extended parafac 
model. 


In Law et al. (Eds.) 



the incomplete referencing system in chemistry journals. Often, titles of articles 
are not mentioned, and authors are required to use abbreviations for journal titles 
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which is asking for errors to be made. For instance, if a paper appearing in the 
Journal of Chemometrics is referred to as JChem or JChemometrics rather than 
JChemom the citation will not necessarily be properly recognised by the search 
program that comes with the Web of Science, The problem is even more serious 
for the most frequently cited journal, Chemometrics and Intelligent Laboratory 
Systems, which should be abbreviated (according to the ISI) as Chemometr. Intell, 
Lab., but variants are Chemom. Intell. Lab. Syst., Chemometrics Intell, 
Chemometrics Intelli., Chemo. Lab. (an often used nickname in the corridors) and 
all variants exist. To illustrate the seriousness of these problems created by 
authors, the key publication by Appellof and Davidson (1981) had 1 miss out of 
33; Geladi (1989) 4 out of 30, Sanchez and Kowalski (1990) 1 out of 54, and 
Smilde and Doombos (1991) 1 out of 26; Smilde (1992) 3 out 28. An even more 
damaging and curious situation exists in the ISI data base: for some (earlier?) 
papers in the data base it is stated that they have no references, which means that 
they cannot be included in this survey because, according to the data base, they do 
not cite the key references. We have not checked this systematically but at least 
for the two chemometric Journals mentioned above, papers published 1987 
through 1991 have been encountered with 0 references, including one of the key 
publications (Smilde & Doombos, 1991), which in fact contains 35 references. 
More specifically, in our list of citing papers, there are no papers during the years 
mentioned which cite the key references, an unlikely, and demonstrably incorrect, 
state of affairs. 



3. Results 

3.1 Introduction 

As mentioned above, the key publications pointed to 120 publications (key 
publications included) in chemistry which refer in one way or another to three- 
mode analysis. Due to space limitations the complete list of these papers including 
which papers they cite etc. will not be reproduced here, but may be requested from 
the author. Different types of papers may quote three-mode analysis for different 
reasons. They may build on the theoretical developments presented in such a 
paper, they may be reviews of the state of the art, they may present an application 
of a three-mode analysis, or they may merely indicate the existence of three-mode 
methods without explicitly using them. The latter type of papers do not really 
make a contribution to the field, but they do signify an awareness of the 
techniques. 

3.2 Some basic statistics 

During the 11 years under consideration (1987-1997), the yearly numbers of 
paper in our data base are: 2, 2, 2, 4, 4, 14, 13, 19, 22, 13, 23, but the number of 
papers in the first five years are too low (see above). Thus, the number of papers 
in three-mode analysis has been steadily increasing over the years. Of course, one 
can hardly say that three-mode analysis is one of the basic tools for analysis, but 
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then again one also uses a spark plug spanner only once in a while for a specific 
kind of job. 

The major chemistry journals publishing new developments about three-mode 
analysis are Chemometrics and Intelligent Laboratory Systems (27), Journal of 
Chemometrics (26) and Analytical Chemistry (14), taking care of 56% of the 
publications. 

3.3 Research groups 

A sign of a relatively healthy state of affairs is that there a large number of 
research groups from both sides of the Atlantic working in the area with hotbeds 
at the University of Washington around Kowalski, University of Amsterdam 
around Smilde, Royal Veterinary and Agricultural University, Frederiksberg, 
Denmark around Bro and several of his colleagues in other Scandinavian countries 
such as Geladi, University of Barcelona around Tauler, McMasters University in 
Canada around MacGregor and Nomikos, while other authors like Booksh (now at 
University of South Carolina), Burdick (Duke University), Henrion (Humboldt 
Universitat Berlin), Ross (formerly Ohio State University) have produced many 
initiatives in the field. A special mention should be made of Bruce Kowalski who 
although never a first author of a three-mode paper is co-author with many of the 
leading players in the field. The attraction of his Center for Process Analytic 
Chemistry (CPAC) has been such that several, often young, researchers such as 
Faber, Smilde and Tauler have been a guest at his institute, and also because of 
that he has been a major factor in the propagation, integration, and application of 
three-mode analysis in chemistry. In this connection also Paul Geladi should be 
singled out as one of the pioneers of three-mode analysis in chemistry, even 
though this might not be so evident from the data in this study. 

3.3 Role of the key publications 

The oldest three chemical key publications contain some references to each 
other, and all contain references to psychometric publications. The first paper to 
bring together all key publications was the review by Henrion (1994), while 
thereafter the only other paper to refer to all of them is Bro et al. (1997). One 
could therefore say that from roughly 1992 on, the integration of the three-mode 
field was nearing completion and all psychometric information was being used in 
chemistry. From then on more papers appear which use whatever references were 
necessary for their purposes. 

This does not mean that all research groups seemed to be equally aware of the 
contributions from psychometrics. For instance, the group around MacGregor and 
Nomikos do not refer at all to their papers, the main reason is probably that their 
technique, multiway principal component analysis, does not really deal with three- 
mode analysis if this is defined by requiring that there are separate parameters for 
each of the three modes. Interesting is too, that apart from the paper by Sanchez 
and Kowalski (1990) it took until 1994 for the first paper from the CPAC group to 
mention work outside the chemometric tradition, and they also mentioned work 
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from other chemometricians very sparingly, which is in a way strange given the 
number of visitors that passed their centre. 

That the initial flow of ideas was from psychometrics to chemometrics is 
supported by the fact that, searching through the Social Science Citation Index®, 
only two publications (Harshman, 1994; Kroonenberg, 1994) outside the 
chemistry sphere refer to the chemometric key publications. So far Kiers and 
Harshman are the only psychometricians with three-mode publications in 
chemometric journals (Kiers & Smilde, 1995; Kiers & Harshman, 1997) 



3.4 Further work 

The above results have of course only a limited breadth and only describe the 
referencing structure without going into the subject matter at hand. The next step 
in this research is to read the abstracts and the papers themselves to link explicitly 
the specific methods developed by the various groups and to assess how the 
papers that are cited are used in the citing papers. This would also result in setting 
aside those papers which only mention three-mode analysis in passing. Moreover, 
it will be possible to pinpoint the specific innovative contributions which were 
generated within chemometrics, apart from borrowing from psychometrics, and it 
will be possible to show the (in part) autonomous development in chemometrics. 

Furthermore, an analysis of the subject areas that use three-mode methods 
should be made to establish which fields especially make use of three-mode 
analysis in their research, and to what extent there is a transfer from one part of 
chemistry to another. 

To produce a more encompassing view of the relations between research 
groups and papers, a co-citation analysis would be very helpful, i.e. the complete 
reference lists have to be checked to assess which papers are cited together in the 
same paper. Thus comparisons should not only be made between the citations to 
the key publications, but also between all publications among themselves. Such an 
analysis might also turn up a more articles which do not cite the key publications, 
but only less prominent ones. 

In the future, we hope to report on such issues. Moreover, comparison between 
different disciplines are planned to answer questions to what extent three-mode 
analysis is incorporated in different disciplines and to uncover (hopefully) reasons 
for these differences. 
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Abstract: Model selection is often judged on only one criteria. We investigate 
the use of the Data Envelopment Analysis to this problem, and consider ways in 
which this aspect can utilize, as well as inform, existing statistical methodology. 
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1. Introduction 

Various aspects of model selection have been studied in the statistical litera- 
ture. Classical methods using information criteria (for example, AIC and BIC 
of Akaike (1974) and Schwartz’s (1978) Bayesian criterion SBC) have been used 
for model selection in time series analysis for some time. The issue of model un- 
certainty is closely related to model selection, and this has recently been tackled 
using Bayesian methods (Draper, 1995 and Madigan, Raftery, Volinsky & Hoet- 
ing, 1996). In this case, it may be that no one model is selected and some sort of 
model average is used. Some of these methods rely on the models being nested, 
as does a simple likelihood ratio test. More general examples of model selection 
can be found in the review of Chatfield (1995); see also the recent work of Sclove 
(1993, 1994). 

Apart from the classical classification approaches like linear and quadratic 
discriminant there are alternative approaches developed by Computer Science 
researchers. Decision trees (Quinlan, 1993), Neural Networks and rule based 
methods (for example, CN2 developed by Clark and Niblett (1989)) are well- 
known examples. Although there have been developments in new classification 
methods, research into the selection of metrics that can consider all aspects of 
the classification models has been neglected. As a criterion for selecting between 
supervised classification methods, the accuracy rate is often the only arbiter; 
see Michie, Spiegelhalter & Taylor (MST, 1994)). The main aim of this paper 
is to contribute to this debate by generalizing the model definition to cover all 
alternative approaches mentioned above and suggest a multi-criteria metric 
for selecting amongst the alternative classification models and take into account 
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various aspects (positive and negative properties) of the models. For this purpose 
we have developed multi-criteria metrics based on Data Envelopment Analysis 
(DEA). 

2. Definition of the '^Classification Model'' 

In this paper we define the classification model as the result of a process which 
uses the objects in a certain domain (for example. Diagnostics, Risk Manage- 
ment, Image Processing). The objects are characterized by some attributes and 
are assigned to one of the known classes. The output of this process is a classifier 
(model) that can be used to find, among others, the class-value of unseen objects. 
For example, in the case of discriminant analysis this classifier is characterized by 
a discriminant function. In the case of neural networks it is characterized by the 
weights of the learned network and in the case of the decision trees by different 
nodes and leaves of the tree involving the attributes subsets. Rule-based classifiers 
(models) are characterized by “if-then rules” that also involve various attributes 
subsets. 

The classification models that we are going to evaluate and select amongst 
belong to each of the above categories. They can also be developed by combining 
two or more of these approaches, which is known as Multi-Sjrategy model 
building. We will discuss various positive and negative properties of such models 
to construct multi-criteria based evaluation metrics. These will be presented in 
the next section. 



3. Various Properties of the Classification Models and the DEA- 
approach 

A survey of the literature on model selection shows that, until now, in most cases 
the selection has been based only on one criterion, namely the accuracy rate. This 
means that most of the available model selection methods in the literature are 
mono-criterion approaches. In our approach, however, we regard a classification 
model as a product (for example, a car) or a production unit (for example, a car fac- 
tory) that can have different properties (Nakhaeizadeh and Schnabl, 1997). Some 
of these properties are positive and some negative. To the positive properties of 
the classification models belong the validity, understandability and usability of the 
results. Complexity, extensive computing time and misclassification costs can be 
considered as negative properties of the classification model. 

In evaluating products (for example, cars) a buyer considers all their positive 
and negative properties. If we regard classification models as products as well, 
we should evaluate these models in the same manner. This means that we should 
use multi-criteria metrics for model selection that are able to take into account all 
positive and negative properties of the models. This allows us to make a more fair 
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evaluation and select between the alternative classification models. 

DEA, which was developed in Operations Research and is used to select be- 
tween alternative products (Chames, Cooper & Rhodes, 1978; Chames, Cooper, 
Lewin & Seiford, 1 996), is an appropriate methodology to compare the alternative 
classification models using their positive and negative properties (Nakhaeizadeh 
and Schnabl, 1997). It also allows consideration of the background knowledge 
and preferences of the user. For example, it can be that for one user the accuracy 
rate of the classification model is more important than the understandability of the 
results. In DEA, it is possible to consider such preferences. Originally, DEA tries 
to develop a ranking system {ox Decision Making Units (DMUs). In our case each 
DMU is a classification model. In DEA terminology, positive properties are called 
output components and negative properties input components. In our terminology, 
a typical example for an output component is the accuracy rate of the classifica- 
tion model. A typical input component is the computation time that the classifica- 
tion model needs for training. Generally, output components are all components 
where higher values are better and input components are those where lower values 
are better. Using these components, we can now define the efficiency of a classi- 
fication model as follows: 

^ . sum of weighted output components 

efficiency = r — F-; — — 

sum of weighted input components 

As the above relation shows, the definition of efficiency covers all positive and 
negative characteristics of a classification model and efficiency in this form can be 
regarded as a multi-criteria based metric that can be used for the evaluation and 
selection between the alternative classification models. In DEA the weights are 
determined for each classification model individually during the computation by 
maximizing the efficiency. DEA chooses the weights so that the efficiency of the 
classification model being evaluated is as close to 100% as possible. At the same 
time, the other classification models should not have an efficiency more than 1 00% 
using the same set of weights. 

Obviously this is an optimization problem that can be transformed to a linear 
program. If the classification model which is under evaluation does not achieve 
the given threshold (100%), then it is not efficient and there is at least one model 
amongst the others which dominates it. There are different ways for setting the 
weights in DEA. The most commonly used ones are input-oriented and output- 
oriented optimization (Ali and Seiford, 1993). The goal in the input-oriented 
optimization approach is to reduce radially the input-levels as much as possible. 
Radially means that all component- values are changed by the same proportion. 
Conversely in output-oriented optimization approach the main goal is to enlarge 
radially the output-levels as much as possible. The ranking of the alternative 
classification models can be done by using their efficiency. 
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4. Empirical Results and Conclusions 

Michie et al (1994) describe the performance of 23 classification algorithms on 
22 different domains. To rank different algorithms, when applied to a certain do- 
main, they use only the accuracy rate for the test dataset although they have data 
about maximum computer storage, training and testing time, training and testing 
error rates or training and testing misclassification costs (where costs are applica- 
ble). To obtain a DEA-based ranking for the classification algorithms applied by 
MST we have used the input and output-oriented versions of DEA. We illustrate 
the methodology for the Credit Management dataset reported by MST (section 
9.2.1). The dataset was donated by a major British engineering company, and the 
two classes can be interpreted as the method by which the debts will be retrieved; 
the class values were determined by an expert on the basis of the given attributes. 
There were 7 attributes (all of which were numerical) and a total of 20 000 obser- 
vations of which 15 000 were used as the training data, and 5 000 were used as the 
test data. 

The Andersen and Petersen (1993) model was used with three input compo- 
nents (max. storage, training time and testing time) and one output component (er- 
ror rate for the test data). As an alternative, we have also used a variant with an 
additional output component (accuracy rate for training data). Input oriented ver- 
sions are denoted by 41 (one output and three input components) and by 51 (two 
output and three input components). Similarly for the output oriented versions de- 
noted by 40 and 50. The DEA-ranking results, as well as the ranked performance 
for the individual components, are given in Table 1 . 

For comparative purposes we carried out a cluster analysis of the components. 
Since there are three units of measurement (storage, time and error rate), we en- 
sured that each component was treated equally by using the rank performance 
which are shown in the first five columns of Table 1. This differs from the DEA- 
models, in which the raw units were used and in which the weighting for each 
component was chosen optimally. The tree structure and the first two principal 
components are shown in Figure 1, and the interpretation clearly differs from the 
DEA-models. However, it can be noted that the efficient algorithms (as defined 
by the DEA-models) have (ranked) observations which lie closer to the origin (in 
a suitable metric) than any other algorithm. This can be partially seen in the PC 
plot of Figure 1 (which is only two-dimensional), but more clearly visualized in a 
brush-and-spin plot. 

We conclude that DEA and perhaps the other multi-criteria based metrics that 
can take into account all positive and negative criteria of the classification model 
can lead to a more fair selection between the alternative algorithms. Such ap- 
proaches open a new perspective to the statistical community in revising the gen- 
eral problem of model selection. 
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Table 1 : Ranking of algorithms for the Credit Management data set using various 
criteria and different DEA~models. The column labelled ''te.err” is the ranking 
of MST, based on the error rate for the test data. The figures in italics mean that 
the algorithm is efficient for that DEA-model 



Algorithm 


Store 


tr.tim 


te.tim 


tr.err 


te.err 


51 


41 


50 


40 


Discrim 


2 


2 


6 


13.5 


13 


3 


3 


9 


7 


Quadisc 


3 


4 


10 


18.5 


21 


14 


12 


18 


17 


Logdisc 


14 


5 


11 


13.5 


8 


16 


13 


15 


12 


SMART 


11 


18 


7 


10.0 


1 


2 


2 


5 


5 


Jb-NN 


5 


19 


18 


12.0 


22 


15 


14 


19 


19 


Castle 


1 


7 


15 


18.5 


19 


6 


9 


9 


7 


IndCART 


17 


8 


17 


5.0 


6 


17 


17 


14 


13 


Newld 


4 


14 


2 


2.0 


13 


1 


8 


9 


7 


AC^ 


19 


16 


19 


2.0 


8 


9 


19 


7 


16 


Baytree 


16 


3 


5 


4.0 


7 


9 


5 


1 


1 


NaiveBay 


15 


1 


3 


16.5 


16 


7 


10 


9 


7 


CN2 


18 


13 


9 


2.0 


12 


9 


16 


8 


15 


C4.5 


13 


6 


16 


6.0 


3 


9 


4 


4 


4 


ITrule 


10 


15 


1 


16.5 


18 


8 


11 


9 


7 


Cal5 


7 


10 


8 


7.0 


4 


4 


6 


2 


2 


DIPOL92 


8 


12 


14 


8.5 


1 


9 


1 


6 


6 


Backprop 


6 


17 


4 


8.5 


4 


5 


7 


3 


3 


RBF 


9 


9 


12 


15.0 


10 


18 


15 


16 


14 


LVQ 


12 


11 


13 


11.0 


15 


19 


18 


17 


18 



Figure 1 : Dendogram and Principal Component plot of the ranked performance 
for the credit management dataset (using the first 5 columns of Table 1). 
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Abstract: While applying Statis\ matrices S of HILBERT-SCHMIDT products 
are derived. These products express the relations between the studies that are 
considered. It is shown that, when there is a common structure between the 
studies, S may be represented as the sum of a rank one matrix with an error 
matrix. Tests for this hypothesis are presented. 

Keywords: STATIS, Matrix Rank, F Tests, Hilbert-Schmidt Products, Hidden 
Periods. 



1. INTRODUCTION 

While using STATIS method, either in the direct or in the dual mode, at 
the end of the first phase (inter- structure), we have a matrix S of HILBERT- 
SCHMIDT products between the operators representing the studies. Orthogonal 
projections, of the points representing the operators on the subspace generated 
by the first two eigenvectors of S, are carried out. When there is a common 
structure between the studies these projections appear grouped near the first 
axis. This indicates that a matrix with rank one can then approximate the matrix 
S. We will show to use this observation to test the existence of a common 
structure. 

2 . INFERENCE 

2.1 MODELS 

Let S is the matrix of HILBERT-SCHMIDT products. We now follow 
FISHER’S lead in writing a model for these matrices: 



' Structuration des tableaux a trois Indices de la Statistique 
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S = '^Xjdjd‘j+E ( 1 ) 

y=i 

where the • • -d^ are fixed [and mutually orthogonal with norm 

one], the elements of being iid, normally distributed with null mean value and 
variance a ^ . 

2.2 F TESTS 

When there is inter-structure and S will be the sum of a rank one matrix 
and an error matrix we then have r=l, and 

S - + E (2) 

If the fixed component in S is sufficiently dominant, the first eigenvector of S is 
a good approximation for . If the remaining eigenvectors of S are the line 
vectors of matrix K, we then will have 

KS = KE (3) 

(at least approximately). Let us now establish 

Proposition 1 

The elements of E’=KE are iid normally distributed with null mean 
value and variance a ^ . 

P. 

If E= • • • el we have E'= • • • el~^ with e'j = Kcj , j=l , . . . ,k. Since 

the elements of E are independent the vectors j=l,...,k, are independent 
between themselves as well as the , j=l,...,k. Besides this, see SEBER 
(1980, pg. 5), the vectors Cj will be normal with mean vector and 

variance-covariance matrix a = AT (a ) ATL Thus the elements of E\ 
which are the components of these vectors will be independent even if they are 
in the same column. 

We also showed that they are normally distributed with null mean values and 
variance a ^and so the proof is complete. 

We now use this result to derive a test for 



Hq : rankiS) < 1 



(4) 




As we saw this hypothesis leads to 
Z = KS=E' 
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(5) 

So, when Ho holds, the elements in KS will be iid N (0,a ^) . We can test this 
consequence. With Z = , g=(k-l)k, 

D^‘t±Z‘ md r = gtz„, (6) 

/=i y=i /=i y=i 

when the Zy, i=l,...k-l, are iid N (0,a ^) , the statistic 



(^-i)r 

D-T^g 



has, see MEXIA (1989), a central F distribution with 1 and g-1 degrees of 
freedom. Using the acceptance region \fan^,g-vfi^,z,u-i we get, see MEXIA 
(1989), a quite powerful a level test for Hq. 

After testing rank (S) < 1 we may want to test 






( 8 ) 



To do this, we obtain the sum S' of the squares of the components of d[S , so 
that we can compute 



3 ’ 



= a-i)^~F( |k,g) 



(9) 



We now establish 
Proposition 2 

When Hq holds 3 has the central F distribution with k and g degrees 
of freedom. 



P. 



When = 0 we have S=E, and with P the orthogonal matrix with line 
vectors df • • -d^ , we can reason as for the proof of proposition 1 to show that 
the elements of E^ = PS = PE are iid N (0,a ^) . Thus the sum S' [D] of chi- 
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squares of the elements in the first line [the other lines] of will be the 
product by a ^ of a central chi-square with k [g] degrees of freedom. Since the 
elements in E are independent, both chi-square are independent and the thesis 
follows. 

We can now use 3 to test Hq . 

2.3 TIME-SERIES 

When, after applying both tests, we are led to assume that rank (S) =i 
we implicitly are accepting that one parameter is enough to “locate” the studies. 

The values of that parameter will be the components of since then we 

have 



( 10 ) 

If the different studies are taken at successive moments, these 
components constitute a time-series. It may be worthwhile to analyse it in order 
to study the global evolution of the studies. 



3. AN APPLICATION 

In this application we used the data on AIDS in Portugal from 1984 to 
1997. This data was organised by regions, which were the objects. The 
variables that were considered were disease and transmission classes, sex, age 
group, and virus type. 

Using this data we obtained 



Table 1; Hilbert-Schmidt products 
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The corresponding eigenvalues and first eight eigenvectors (table 2 and table 3) 
Table 2 : Eigenvalues the Hilbert-Schmidt products 



1904.73 



289.11 



182.51 



^4 

104.00 



85.34 



57.79 



39.65 



33.33 



32.96 



^10 

28.95 



26.31 



23.09 



20.73 



14.84 



Table 3 : Eigenvectors the Hilbert-Schmidt products 



El 




0.337 


0.430 


0.332 


0.220 


0.233 


0.180 


0.162 


0.230 


0.126 


0.283 


0.267 


0.346 


0.116 


m 


0.006 


- 0.146 


- 0.330 


- 0.101 


0.158 


0.105 


0.224 


0.322 


0.078 


0.384 






- 0.196 


0.685 


m 


0.022 


0.136 




0.014 


- 0.191 


- 0.157 




- 0.028 


- 0.205 


■ 0.607 


0.081 


- 0.081 


0.130 


0.666 




1^^ 


- 0.168 




- 0.245 


• 0.020 


0.156 


0.302 




0.025 


- 0.415 


0.101 


0.225 


- 0.199 


- 0.135 


m 




- 0.111 


0.086 


- 0.074 


- 0.381 


- 0.139 


- 0.423 


- 0.403 


0.186 


0.434 


0.069 


0.069 
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0.294 
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- 0.021 




0.026 






0.106 
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0.315 


- 0.430 


0.245 


- 0.199 


- 0.539 


0.170 


- 0.186 






0.021 




- 0.226 


- 0.284 


- 0.341 


- 0.237 


- 0.024 


0.189 


- 0.108 


- 0.045 






0.576 


0.068 


0.566 


0.010 



Using this date we obtained 3 = 0.563 which is non-significant as well 
as 3 = 131.028 which is highly significant. We thus accept that r=l. 

Since the studies were taken for successive years and may be located by 
a single parameter we carry out the analysis of the corresponding time-series. 
The values of this parameter were: 



Coord , first 
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And the chronogram was 



Figure 1: Chronogram 
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This chronogram suggested using the hidden periodicity model, see 
GRENANDER & ROSENBALTT (1975, pg. 91 to 94). We thus used the last 
13 years to adjust the model 



y, = 10.94571 - 3.34127 cos(—) - 2.366508 sin(—) 

13 13 

-3.340611cos(—) +0.322584 sin(—) 

13 13 

+ 3.735936 cos(^) +1.585178sin(^) 

13 13 



In the following graph we placed the adjusted values as abscises and the 
observed as ordinates: 




This graph shows an acceptable adjustment, we are thus led to assume 
the existence of cyclic components with periods 13, 4.33 and 2.16. 

As a parting comment we point out that the chronogran we obtained is much 
easier to analyse than the usual two-dimensional projections. 
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Abstract: The 3 -way data clustering was applied to the analysis of a number of 
CD4+ and CD8+ cells obtained from Japanese hemophiliacs infected with HIV-1 
through non-heat treated clotting factor concentrates. A total of 131 
hemophiliacs were classified into four clusters, termed Cluster 1, 2, 3 and 4. 
The members in Cluster 1 and 2 showed almost continuous decline in the number 
of CD4+ cells, while members in Cluster 3 and 4 showed unclear declining 
tendency compared to Cluster 1 and 2. The number of CD8+ cells was 
declining in Cluster 1, 2 and 3, while it was rising in Cluster 4. Nevertheless, 
the cumulative onset rates of AIDS in the four clusters cannot be divided into the 
four groups accordingly. The rate was the highest in Cluster 1 while no eminent 
differences were found among the other three clusters. Therefore, we could 
identify that there is a large variety in the time courses of CD4+ and CD8+ cell 
numbers during AIDS incubation period, even if the onset rates were almost 
identical. These findings may be helpful to find the clue in preventing the onset 
of AIDS and selecting appropriate therapy for HIV-1 infected patients. 

Key words: 3 -Way Data, Fuzzy Clustering, AIDS, Hemophilia. 



1. Introduction 

The 3-way data clustering can analyze data with a 3-way structure composed of 
objects, attributes and situations (Sato & Sato, 1994). This means that a proper 
application of 3 -way data clustering to time-series clinical data will result in a 
good data mining, because such data are composed of patients (objects), 
measured values (attributes) and date of measurements (situation). We applied 
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the 3-way data clustering to the analysis of time-series subset numbers of 
lymphocytes in a cohort of Japanese hemophiliacs infected with human 
immunodeficiency virus type 1 (HIV-1). Subsequently we evaluated the results 
of the classification by referring to the cumulative onset rates of acquired 
immunodeficiency syndrome (AIDS) in this cohort. 



2. Data Description and Classification Method 

The Research Committee on Prevention of Developing Illnesses and Therapy for 
HIV Infected Patients has conducted pathological status surveys of HIV-1 
infected people in Japan (Yamada, 1992) from 1987 to 1996. Especially, its 
Natural History Committee (NHC) has collected survey data from various areas 
of Japan and constructed a database. The numbers of CD4+ and CD8+ cells 
after 1985 have been reported every half year; one from the first term (end of 
June) and another from the second (end of December). Among those data, sets 
of 13 point values of CD4+ and CD8+ cell numbers collected during the six-year 
period from the first term of 1986 to the first term of 1992 were forwarded to the 
program for 3 -way data clustering developed by Sato and Sato (1994). 

We dealt with the data of CD4+ and CD8+ cells as the three-way data with three 
suffixes as follows: 



^(0 / i=l,2,....,131 ; a=l,2 \ 

^ t=l2, ,13 ^ 



( 1 ) 



where i is a suffix for patients, and a for CD4+ and CD8+ cells. The suffix t 
with parentheses in Eqn. 1 indicates that it is the ^th measurement. 

The values of the membership functions of K fiazzy subsets that are named as the 
degree of belongings of the /-th patient to the i-th cluster, Ufk , was obtained as 
the optimum that minimizes the sum of extended within-class dispersion as 
follow: 



/=i 



where are weights and > 0. 



( 2 ) 






k=\ 



( 3 ) 



The solutions U and v for minimising J(U,v) are given by 
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where m is a parameter that determines the degree of fuzziness of the cluster 
(Sato & Sato, 1994) and 



3. Four Clusters 

Time-series data of CD4+ and CD8+ cell counts were available from a total of 
131 hemophiliacs among 953 registered patients in the database of NHC. We 
could classify 131 patients into four clusters, termed as Cluster 1, 2, 3 and 4. 
The number of clusters is determined based on the goodness of fit between the 
model and observations. The number of patients included, n, and the degree of 
belongings averaged in each cluster are shown in Table 1. The largest cluster 
was Cluster 2 (n=49) and the smallest was Cluster 4 (n=13). 



Table 1 ; Number of patients in each cluster, n, and degree of belonging 
(Average±SD) to own cluster 



Cluster 


n(%) 


Degree of belonging 


1 


36 (27.5%) 


0.923 ±0.123 


2 


49 (37.4%) 


0.850 ±0.143 


3 


33 (25.2%) 


0.839 ±0.163 


4 


13 (9,9%) 


0.785 ±0.213 



The motion of the centroids of four clusters is shown in Fig. 1 . The tendency of 
steady decline in CD4+ cell numbers was eminent in Cluster 1. Gradual 
declining was observed also in Cluster 2, although it declined more slowly than 
Cluster 1. On the contrary, the declining was unclear in Cluster 3 and Cluster 4. 
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Fig. 1 : Motion of centroids of four clusters 




Moreover, most of the reported numbers of CD4+ cell in Cluster 3 and 4 were 
higher than 500 X 10^ /L. The number of CD8+ cell was steadily declining in 
Cluster 1 . Slower declining was shown in Cluster 2 and 3. On the other hand, 
the number of CD8+ cell showed rising with a gradient of 90.9 (95% Cl, -5. 1 ~ 
186.8) X 10^/L per year in Cluster 4. 



4. Aspects of Each Cluster 

Cumulative onset rates of AIDS since the beginning of 1986 in each cluster are 
shown in Fig.2. The Kaplan-Meier estimates (Kaplan & Meier, 1958) for the 
fraction of patients with AIDS in Cluster 1 has been the highest among the four 
clusters. In other clusters, only part of the patients (less than 1 1%) developed 
AIDS by the end of March 1994. 

Fig. 2: Cumulative onset rates of AIDS in four clusters 
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1 994 
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Numbers of cases with clinical disorders in the four clusters are shown in Table 2. 
Pneumocystis carinii pneumonia that has been one of the AIDS defining clinical 
diseases was observed only among patients in Cluster 1 . Although the onset of 
clinical AIDS was not reported, herpes zoster and other symptoms have been 
observed among patients in Cluster 3. 

The shapes of distribution of age at the first term of 1986 in Cluster 2, 3 and 4 
were almost same. Although the broader range and lower median were 
observed in Cluster 1, no statistically significant difference was found in age 
distribution among the four clusters. 

Table 2: Reported number of cases with clinical disorders in the four clusters 



Disorders 


Cluster 1 


Cluster 2 


Cluster 3 


Cluster 4 


Herpes zoster 


9 (25.0%) 


6 (12.2%) 


3( 9.1%) 


2(15.4%) 


Mycosis 


7 (19.4%) 


3 ( 6.1%) 


0 


0 


Pneumocystis carinii pneumonia 


6 (16.7%) 


0 


0 


0 


Intractable diarrhea 


5 (13.9%) 


3 ( 6.1%) 


0 


1 ( 7.7%) 


Others 


14 (38.9%) 


8 (16.3%) 


6(18.2%) 


2(15.4%) 



5. Discussion 

The number of CD4+ cell has been considered to be the most informative factor 
to indicate the disease status (Young et al., 1997). Therefore, a proper 
classification of time courses of the CD4+ cell numbers during AIDS incubation 
is valuable to evaluate the natural history of HIV. Using the slope of changes in 
CD4+ cell count from linear regression is one of the practical methods adopted in 
some of the definitions of long-term nonprogressors (Strathdee et al., 1996). 
However, a more appropriate method would be needed to figure out principal 
characteristics of time courses of CD4+ cell numbers. The present application 
of 3 -way data clustering will be one of the solutions of this problem. 

The 131 hemophiliacs can be classified into four clusters by the characteristics 
shown in Fig.l. Nevertheless, the cumulative onset rates of AIDS in four 
clusters cannot be divided into four groups (Fig. 2). The rate is the highest in 
Cluster 1 while no eminent differences are found among the other three clusters. 
Moreover, the clinical symptoms summarized in Table 2 also indicate the worst 
conditions of patients in Cluster 1 . No remarkable difference is shown among 
other clusters. Therefore, the present results of clustering seems to be quite 
meaningful because we could identify that there is a large variety in the time 
courses of CD4+ and CD8+ cell numbers during AIDS incubation period even if 
the onset rates were almost identical. For example, the number of CD8+ cell 
has been rising in Cluster 4 in addition to its highest average among the clusters 
at the initial point. Thus the patients in Cluster 4 may manifest a more dynamic 
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response to cytotoxic T-lymphocytes, which requires fiirther investigation to 
confirm compared to those in other clusters. Identification of such patients may 
be helpful to find the clue in preventing the onset of AIDS and selecting 
appropriate therapy for HIV-1 infected patients. 

Older age at infection with HIV-1 has been reported to be tightly associated with 
more rapid AIDS progression in hemophiliac cohorts ( Rosenberg et al., 1994; 
Munoz & Xu, 1996). The individual age at the time of infection has not revealed 
by NHC. However, a tentative mean date of seroconversion has been proposed 
by the Committee as April 1983 (Mimaya et al., 1992). Therefore the age at the 
first term of 1986 can be considered to be approximately 3 years after 
seroconversion. In the present study, no statistical difference in age at the first 
term of 1986 was found among the four clusters. Thus the age at infection will 
not explain the differences of CD4+ cell time courses and AIDS onset rates 
among clusters. Therefore, further investigation is necessary to detect other 
background information that resulted in the present classification. The most 
important of all would probably be the viral loads of patients in each cluster. 



References 

Ferligoj, A. & Batagelj, V. (1994). Direct multicriteria clustering. Journal of 
Classification, 9, 43-61. 

Kaplan, E.L. & Meier, P. (1958). Nonparametric estimation from incomplete 
observation,. Journal of American Statistical Association, 53, 457-481. 

Mimaya, J., Meguro, T., Tatsunami, S., et al. (1992). Natural history committee 
report, in: Annual Report of Research Committee on Prevention of 
Developing Illnesses and Therapy for HIV Infected Patients 1991, Yamada, 
K.(Ed.), 9-16. 

Munoz, A. & Xu, J. (1996). Models for the incubation of AIDS and variations 
according to age and period. Statistics in Medicine, 15, 2459-2473. 

Rosenberg, P.S., Goedert, J.J. & Bigger, R.J. (1994). Effect of age at 
seroconversion on the natural AIDS incubation distribution, AIDS, 8, 803- 
810. 

Sato, M. & Sato, Y. (1994). On a multicriteria fuzzy clustering method for 3- 
way data. International Journal of Uncertainty, Fuzziness and Knowledge- 
Based Systems, 2, 127-142. 

Strathdee, S.A., Veugelers, P.J., Page-Shafer, K.A., et al. (1996). Lack of 
consistency between five definitions of nonprogression in cohorts of HIV- 
infected seroconverters, AIDS , 10, 959-965. 

Yamada, K. (1992). Pathological status and therapy of HIV-infected 
hemophiliacs in Japan, Southeast Asian Journal of Tropical Medicine and 
Public Health, 23 (Suppl 2), 127-130. 

Young, F.H.L., Taylor, J.M.G., Bryant, J.L., et al. (1997). Dependence of the 
hazard of AIDS on markers, AIDS, 1 1, 217-228. 




PART VI 



Case Studies 



• Applied Classification and Data Analysis 





On an Error Detecting Algorithm on 
Textual Data Files 
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Abstract: The paper presents a neAv algorithm ,TEFIC, based on the Bayes 
formula for automatic textual data detection and correction.The algorithm is able 
to clean any textual file in a data base without a priori knowledge of the semantic 
interpretation of each word and of the linguistic rules characterizing a given 
language.The algorithm is able to give a good performance on data bases 
including names and numbers.The kernel of the algorithm is the Bayes formula 
and the evaluation of the a priori probability, based on clusters of words.Clustering 
of words based on the Hamming distance becomes the key for detecting and 
cleaning successfully errors in a textual data file. 

Key Words 

Bayes, correction, detection, textual data. 

1. Introduction 

The aim of the paper is to present a new algorithm, TEFIC, able to clean the 
errors in a textual data file without any knowledge of the semantic interpretation 
of the words and of the rules of the language.The area of interest of the paper is 
somewhere between classical statistics , informatics and linguistics. The main 
use of the algorithm TEFIC is the automatic cleaning of large textual data files 
containing thousands of records regarding the features of specific units,like the 
workers of a given firm as well as the tax payers of a given country. 

The main strategies for detecting and correcting errors in large data base,with 
billions of textual data, is to exploit a thesaurus and to compare each textual data 
with the most similar elements of the thesaurus.. The similarity can be evaluated 
considering either the simple sequences of the elementary units like letters or 
numbers,or the meaning and therefore the word with the similar meaning.Finally 
another approach is to introduce linguistic rules which are constraints on the 
sequences of elementary units,like letters, commas,points and numbers.All the 
approaches are based on some a priori knowledge depending on the specific 
language.Following our point of view the effort of introducing the knowledge can 
be avoided provided we can build up an algorithm able to recognize the actual 
rules belonging to a given language from a random sample of records and to 
increase its knowledge as long as it is running on the data base ameliorating its 
performance on job.The strategy can be compared with the usual techniques for 
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signal filtering where we deal with the symbols of the signals and the problem is to 
clean the random noise that is able to modify the symbols, Peirce (1961), Tong 
(1990). In the coming paragraph we will describe the main features of the 
algorithm and we will give some information on its performance. 

2. The foundations of the TEFIC algorithm 

The TEFIC algorithm is based on the probability of occurrence of a given word 
and on the probability of occurrence of a given character,like a letter,a comma and 
some others which can be identified, in the text.Moreover we suppose that the 
source text is randomly modified so that the actual text is a transformation from a 
source text obtained by a random disturbance. This point of view is quite similar to 
the usual assumptions on a randomly disturbed channel of a digital 
transmission. Some other assumptions can be stated as follows. 

Hi .a finite list 0 of N words is considered whose members are both the words be- 
fore the disturbance and the words after the disturbance; 

H 2 . a word Sj can be a random mutation of a subset Li of other words in 0 ; 

H 3 . an equiprobability distribution is assigned to the list 0 ; 

H 4 . the occurrence of a given word in 0 is statistically indipendent from the oc- 
currence of another word. 

The consequence of the previous assumptions is that we can identify the list of the 
words 0 and that their occurrence is submitted to the sampling rules. The full 
knowledge of the list is obtained as far as the size of the samples increases 
following a learning process. 

Assumption H 2 implies that we can assign to each element of 0,Si,a subset Li of 
0.The subset Li can be identified either by rules or can be identified by a 
systematic substitution of each character with all the other characters of the 
alphabet and finally by permutation of all the characters of the word.Another way 
can be considered either deleting a character or adding a new character.lt is easy 
to see that the size of the subset Li can be extremely large and the final purpose is 
to identify the subset to which the right word belongs. As a final comment ,the 
identification of a high probable subset to join to each word can be a matter of 
discussion but its role can be appreciated as far as the exploration of the full set 0 
is pratically impossible. As a consequence we have to build up a set of clusters 
which are a covering of the set 0 . 

Actually,each word is the kernel of a cluster and a textual data file can be 
considered as a set of kernels. The set of all clusters related to the file is a 
population of overlapping clusters. 

The problem we deal is to detect the wrong words and to correct them by a search 
strategy in the population of clusters. In the next paragraph we will introduce the 
search algorithm based on the Bayes formula and on the clusters generation. 

3. The algorithm TEFIC 

In this paragraph we introduce the TEFIC algorithm which is able to detect the 
wrong words and to correct them. 

Let us introduce some notations in order to write down the TEFIC algorithm: 
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n(0i) = Hi : the absolute frequency of the occurrences of 0i ; 

1 (0i ) = li the length of the word 0i ; 

0 j = 0j-> 0j : the word 0i becomes the word 0j belonging to the same cluster C; 
H(0i,0j) : the Hamming distance between 0i and 0j ; 

0i=0^ 

n j : the frequency of mutations of the word 0i into the word 0j . 

The Hamming distance is the number of different characters of the two words, in 
the same position. We can now introduce the following probabilities.Let 

i. P(0i) = ni / N where N = I ni . 



ni 

ii. ,P(0i n 0‘j ) = P(0'j ^ 0j ). (2) 

N 

P(0in0‘j) 

iii. P(0i 1 0‘j ) = = P (0‘j ^ 0i ). (3) 

P(0‘j) 

P(0W0.)n‘j 

iv. P (0'j 1 0‘ ) = (4) 

ZkP(0W 0i)n‘k 

The frequency n j is the number of occurrences of mutations of the word 0* into the 
word 9j belonging to the same cluster of words, C, whose cardinality is ni(C). 

The last equation is the main result of the paper because by (4) it is possible to 
evaluate the probability of observing a mutation 0 j ,where the source word is 0*. 
The previous equatione is the probability of an error and therefore any word in the 
text can be a transformation of some other word.In other terms the message can be 
different from the source. 

It is possible to set up some hypothesis on the probability of an event :0* -> 0j .The 
first hypothesis is : P( 0 j ) =1, Vje C .The consequence of the previous statement is 
the following: 

n‘j 

P(0-j|0i)= (5) 

N 

The probability that a word is the transformation of some other word is equal to 
the relative frequency of the word in the set of all the possible words. 

The previous hypothesis is the simplest one and it is based on the actual 
occurrences in a given text file and was effectively considered in order to test the 
algorithm.In this case all the informations come from the data and the estimation of 
the probability is reduced to the estimation of the frequency of the data. 
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The empirical approach can be justified from the frequentist point of view and does 
not implies theoretic assumptions. The experimental results will be considered later 
on in the paper. 

Some other hypothesis can be introduced. The simplest one is the equiprobability 
and the final formula is the following one: 

P(e-.|0,)= (6) 

2k (n‘) 



The previous assumption is the simplest one without any support of the data. It is 
easy to see that (6) is actually a probability so that : 

The simplest assumption is the simplest one without any support of the data It is 
easy to see that (6) is actually a probability so that : 



0<P(e‘j) = I 

ij p(0‘j 1 0i) = 1 



SjP(0‘->0j)=l 



A final hypothesis can be considered based on the errors associated to the 
characters of the alphabet . In this case we are strictly constrained to the choice of 
the alphabet. Some assumptions should be introduced like the following ones : 

i. the errors are transformation of a set of characters into a set of characters; 

ii. the characters should be identified in advance; 

iii. the probability space of the characters is equiprobable; 

iv. the probability of a transformation of a given character into another character is 
a known number. 

We can easily evaluate the following probability: 

P(0'->0j) = 26''^“’^*pe"‘“’‘^*(l -Pc (7) 

It is possible to recognize the binomial distribution, where the number 26 comes 
from the numbers of characters in latin alphabet and the probability Pc is the 
probability of each character. 

We can easily see that all the previous considerations can be introduced for the 
estimation of the probability of each character.Moreover we still used the 
Hamming distance for evaluating the previous probabilities. We can introduce some 
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fiirther assumptions on the occurrences of the words.In case we consider only 
words that have H (O' ; 0j ) < 1 ,we can rewrite the previous distribution as 
follows 



1 

i. If e^j^Oi then P(0' ->0j) = Pc ( 1 -Pc)“j ' (8) 

26 

ii. lf0‘j=0i then(l-pc)“ (9) 

The consequence of the previous formulas is that we can no longer consider the 
fiinction P(0'j ) as a probability in spite of ; 

i. O<P(0'j)<l 

ii. SjP(0'j) = l 

Finally we can write down the explicit formulas for the conditional probabilities: 
i.For 0'j ^ 0i we have ; 



P(0‘j I 0i) = 



pc 



26 ( 1 ” Pc ) Hi + pc Sic n ic 



ii.ForGj =0i 

26 (1 “ Pc ) Hi 

P(0’j 1 0i ) = 

26(1 - pc )ni+PcSknk 



( 10 ) 



( 11 ) 



The last probability is greater than the previous one depending on the ratios 
between the frequencies nj and ni The equality like (10) = (1 1) can be combined 
by the relation : 



rfj 1 

= 26( 1) (12) 

nj Pc 



It is easy to see that the probability pc plays a basic role in the evaluation of the 
conditional probabilities of the words. 

The TEFIC algorithm can be reduced to the following steps ; 
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Step 1 :read a text data file; 

Step 2:cluster the words by the Hamming distance ; 

Step 3:evaluate the probability of the caracter and the conditional probability of 
the 

words; 

Step 4: detect the words with the highest probability and decide which words 
should be substituted; 

Step 5 : find the highest probable words able to substitute the detected words in 
the previous steps and Stop the algorithm. 

4.Concluding remarks 

The paper presents a statistical approach for detecting and cleaning the errors in 
the text data files.The main featuce of the proposed algorithm is its ability to avoid 
the tesaurus and to exploit the probabilities of the occurrences of the possible 
errors by a suitable set of clusters of words where the adopted metrics is based on 
the Hamming distance between the words.The problem of the errors in large data 
bank is well known and a fast and not very expensive approach to the solution of 
the problem of detecting and cleaning errors can be appreciated as far as the usual 
approach are very expensive. 
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Abstract: Multidimensional Data Analysis (MDA), by means of statistical 
methods, is used in order to select the main sources of variability and to 
establish a hierarchy of the characteristics under study. This hierarchy is 
necessary in social medicine for decision making. 

Key words: Cluster, Correspondance Analysis, Social Medicine, Decision 



1. Introduction 

The MDA is applyed since 1985 in social medicine, to elaborate the 
strategies for an optimal ratio public health needs/medical staff availability. 
Multidimensional Data Analysis can solve, through statistical methods, the 
problem of characteristics under study hierarchy. This hierarchy is necessary in 
social medicine for decision making, considering the constraints and 
implicating the priorities. 



2. Objectives 

A synthetic data representation, with a minimum loss of information and a 
maximum relevance is focused. A synthetic indicator is theoretically 
unacceptable (Arrow's theorem). In order to combine the continous methods 
with the discrete techniques (Naouri's theorem), scientists have to establish a 
hierarchy. The "without previous hypothesis" (Benzecri) methods, that don't 
impose a normal distribution of data, may be mentioned. 



3. Methods 

The factors and the priorities assessement and hierarchisation, the MDA 
regarding the primary health care, including the number of physicians that is 
needed, are studied by means of the correspondance analysis (J. P. Benzecri). 
The applications mentioned above were solved, first, by an adapted FORTRAN 
(1985) and, later, by SPSS (1991) and SYSTAT (1995) soft. The distances or 
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similarity index computation was made by means of EUCL (euclidean 
distance), SUMSQ (squared euclidean distance), CHISQ (chisquare index), 
CORK (dissimilarity index specific to the minus correlation), JACC 
(dissimilarity inddex specific to the negative Jaccard index). The chisquare 
distance (SPSS) and the euclidean distance (SYSTAT) were used from those 
mentioned above. 

As a discrete method, the cluster analysis is applied. The “Cluster Analysis 
Program” by Tim E. Aldrich & J. Wanzer Drane, from South Carolina 
University & Oak Ridge National Laboratory was used recently (1996), 
particulary Ohno and Grimson methods, with a cartographic representation on 
the geographic map of Romania. Ohno method compares rates within 
geographic cells to determine high rate areas cluster. Ohno program uses 
chisquare because more than two rate categories are possible. Grimson method 
uses only a high and low rate group. It calculates and approximates normal (z) 
statistic to test for non-random mixing of high and low rate (areas) counties. 
Grimson program is the more precise of the two. 

Romania is divided in 41 districts ("judets"), numbered in the alphabetical 
order. For example, Alba (AB) district has the number 1, so the districts around 
it are 2, 5, 14, 23, 28, 34 and 40. A geographic referencing file, named 
Rocounty.geo, is created for all districts. This geographic configuration is 
created into cluster and then specific data can be entered. Each file can be later 
edited at will. Mention may be made for the district name abbreviations (i.e. 
Alba-AB, Bucharest - B, s.o ). 



4 . Data 

Frequency contingence tables (absolute or relative data - physicians per 
10000 inhabitants, since 1985) were used for selecting the main objects (41 
districts) and/or characteristics (15 medical specialities). 



5 . Results 

The two main factorial axes (FI and F2) were settled by means of the 
correspondance method (FORTRiW program): FI represents 65.35% of the 
whole variability and F2 represents 21.65% of the whole statistic variability, 
with 87.00% "quality” of the representation. The chosen model made possible a 
dual analysis, so the object-lines (districts) and the characteristics-rows 
(medical specialities) are both represented with respect to the same factorial 
axes (fig. 1). 

B, TM, MS, CJ, DJ and IS, the districts with the main academic centers, are 
important, being positively correlated with the first factorial axis (FI). The 
other districts are correlated in opposition with those mentioned above. The 
center of the factorial axes is occupied by PH district. 
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Fig. 1 : Dual representation of districts/medical specialities 
(specialists/ 10 000 inhabitants) 




Less variability is recorded with respect to the second factorial axis (F2) 
which oposes GL and BR districts to CV district, and s.o. 

The second representation of the dual analysis makes possible the tight 
correlation of the medical specialities (physicians per 10 000 inhabitants) with 
respect to the first factorial axis (FI): pathological anatomy (1) with laboratory 
(8), epidemiology (3) and radiology (4) being placed in the center of the 
factorial axes. Other medical specialities, such as internal medicine (6) and 
psychiatry (13) are more deficitary in certain districts, being opposed to those 
mentioned above. 

SPSS program gives a similar representation for the main objects and/or 
characteristics (fig.2). 

A SYSTAT program (euclidean distance) was used to select the main 
similarities of the physicians number by medical specialities on districts. At the 
first level of the diagram, under the reserve of non-standardization of data, the 
following classification of districts can be seen (fig. 3): 

((B, (CJ, PH)); (TM, IS, MS); (CS, HD, (BH, VL, CV)); 

(IL, (DB, SV, BT), (GJ, SJ), OT, SM); (IL, OT, SM); 

((BZ, CL), (TL, VS), (BC, GR), GL, VR, TR, BS); 

(DJ, (AG, BR)), CT); (AR, BV, SB, HG, NT); ((AB, MM), MH), s o. 
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¥\g. 2. Similar representation of Fig. 3: Classification of 

districts/medical specialities districts/medical specialities 




In order to avoid the possible causes of a random aggregation of data, 
cluster analysis is also motivated. Practically, a public health policy can be 
based, for example, on a geographic distribution of objects and/or 
characteristics. 

A geographic representation concerning some data was created to be 
called in a spatial cluster program. A classification of the main areas of interest 






645 



in decision making became possible, with a cartographic representation on the 
geographic map of Romania, i.e. regarding the pathology specialists (fig. 4). 

The following 12 significant clusters are noticeable (p=0, 10): 

(B PH) (CJ BH) (CJ MS) (IS BT) 

(IS NT) (MS HG) (MS SB) (TM CS) 

(NT HG) (SB AG) (CT BR) (GL BR) 

Fig. 4: Spatial classification of districts ranged by (1)- pathology specialists 
(1/10 000 inhabitants) 



p - 0.10 




On the map, a special cluster of the west districts TM and CS, of east-districts 
GL, BR and CT and of the districts B and IL can be mentioned. The other 
marked districts are located in the north-central side of the country. 



6. Conclusions 

The development of Multidimensional Data Analysis (MDA) can improve 
the decision making in the field of social medicine research. 

The districts identified by MDA represent areas of priorities in public health 
for decision making. 
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Classification and Decision Making 
in Forensic Sciences: 
the Speaker Identification Problem 

Carla Rossi 

Dipartimento di Matematica 
University of Rome Tor Vergata, Italy 

Abstarct: The decision process in forensic sciences is quite a complex 
phenomenon, conditioned by incomplete information and wide uncertainty. 
Setting up a proper model implies modelling an inductive process based upon 
inferential steps controlled by the evaluation of a set of conditional probabilities 
and likelihood ratios. 

The model proposed in the present contribution is based on an inferential 
bayesian paradigm and is outlined using a real example concerning the 
identification of a defendant by means of vocal ‘prints’, in more technical 
terms: ‘the speaker identification problem’. 

A comparison with other methods based on the classical statistical approach 
allows to appreciate the main features of the present model in computing the 
proper conditional probabilities of the scientific evidence under the two 
alternative and exclusive hypotheses: the prosecution hypothesis and the 
defence hypothesis. 

The likelihood estimation problem is also considered. 

Key Words: speaker identification, classification errors, likelihoods, kernel 
density estimation, bayes factor. 



1. Introduction 

It is clear from everyday experience that the speech signal carries information 
about its producer. There are specific circumstances in which it is crucial to be 
able to determine the identity of a speaker from speech alone. For instance, a 
witness to a crime may have heard the masked perpetrators speaking, or tape 
recordings of telephone conversations may have to be compared with the voice 
of a criminal suspect. 

Speaker recognition might be defined as any activity whereby a speech sample 
is attributed to a person on the basis of its phonetic, acoustic or perceptual 
properties. A distinction can be drawn between naive speaker recognition and 
technical speaker recognition. In the first case the recognition is performed by 
untrained observers. The decision is based on what is heard and no special 
techniques are involved. In the second case much work on the comparison of 
recordings is required and to establish identity involves acoustic analysis and 
probabilistic/statistical methodologies. In speaker recognition we can also 
distinguish two categories: speaker identification and speaker verification 
problems. In the present paper we consider just the speaker identification 
problem which includes the usual forensic situation. 

The forensic task is generally to compare two samples, one from a known 
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speaker and one from an unknown source, and ultimately a criterion has to be 
applied to reach a decision as to whether the san^les are similar enough to he 
from the same speaker. 

The speech of an individual is not a constant, either in terms of those properties 
which result from the physical mechanism of speech or those which are a 
fimction of the linguistic system. If we imagine the many parameters which 
characterize a speaker as defining a location in a multiple (Umensional space, we 
see that it is not a smgle point which characterizes the speaker but an area of 
variabihty, i.e. a probabihty distribution. 

There exists extensive hterature dedicated to the identification problem In the 
I970's, there was a prevalent use of spectrographic analysis in voice signals; in 
the last few years, statistical analysis of properly defined acoustic parameters 
was developed, so as to take into account stochastic aspects. The almost 
con^lete totahty of the statistical methodology used is of a parametric type for 
what concerns the estimation of the likelihoods, and it is based on classical 
statistical hypothesis testing for what concerns the classification problem In the 
following we analyse a non-parametric estimation procedure that avoids the 
imposition of strong hypotheses that have aheady been proved inconsistent, as 
multinormahty of the acoustic measures studied, i.e. the formants' of the vowels 
(Paronetto, 1995; Beciani, 1998). 

For further details on acoustic analysis and data one can see the specific 
references reported below. 



2. The statistical-decision problem 

Usually the technical speaker identification is based on the analysis of the 
frequency distributions of the formants of the vowels. By the formants we mean 
the acoustic frequencies corresponding to the local maximum intensity on the 
Fourier transform of the soimdwave; depending on the position of these 
maximum points we can identify one or the other vowels. Because of the effect 
of anatomical and socio-linguistic characteristics, the formants of a speaker have 
a particular probabihty distribution in a multidimensional space ‘characterising’ 
that speaker. 

One of the foundamental issues in speaker identification is whether each 
individual in the population corresponds to a different and identifiable formant 
distribution or whether there is wide overlap. 

It is weU known that, unfortunately, this latter is the case, thus it follows the 
crucial problem of dealing with the classification errors, which are related to 
whatever decision procedure adopted. 

Furtherly, we must take into account that the identification process, comprising 
the estimation of the individual formant distribution, has to be based on a usuaUy 
smaU sample of measured formants, thus implying wide uncertainty. These facts 
place a limitation, in priciple, on the rehabihty of any act of speaker 
identification, a limitation which must always be almowledged. Indeed, ideally, 
the acoustic parameters extracted should exhibit a large between speaker 
variabihty and a low within speaker variabihty, but for the actual data this is not 
the case. 
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2 1. The likelihood estimation problem 

The data on which the actual analysis is based relate to 100 different recordings, 
each coming from a different male itahan adult speaker. For each individual 4 
formants are measured for four different vowels: "a", "e", "i", "o". The "u" has 
been excluded due to its low frequency in the itahan language. Each recording 
comprises a different number of rephcations for each vowel. So, each speaker is 
represented by a frequency distribution in a 16 dimensional space that can be 
modelled by a proper probabihty distribution. 

Such distribution has to be estimated before dealing with the classification 
problenoL An extensive description of the data set can be found in Calvani 
(1996). 

As a matter of fact the 16 parameters result highly correlated, thus a reduction of 
the complexity of the problem is possible by means of multivariate data analysis 
methods as, for instance, principal component analysis or factor analysis 
(Beciani, 1998). 

Suppose that the independent factors' can be represented in a k (<16) 
dimensional space. To determine the proper probabihty distributions for the 
classification problem it is possible to use different approaches. We can estimate 
the k-dimensional distributions and then decide what kind of summary statistic 
can be used to make decisions about the speaker identification problem, or we 
can estabhsh a proper vector dissimilarity index to measure the ‘distance’ 
between the characteristic measures coming from the recording of the 
defendant's voice and the same measures coming from the recording of the 
unknown speaker's voice and estimate the probabihty distributions of the statistic 
under the two alternative hypotheses, that means the prosecution and the 
defence hypothesis. 

For the present apphcation we use the second approach and, after defining a 
proper vector (or scalar) distance between the observations (or the principal 
conq)onents), we estimate the interesting likelihoods, i.e. the conditional 
probabihties of the evidence under the two alternative hypotheses, that in the 
present apphcation are related to the within and between speaker distributions, 
using the kernel estimation method. Cross vahdation based on Pseudo Maximum 
Likelihood method is used to optimize the smoothing parameter. 

The estimated likelihoods are used to set up the proper classification algorithm 
based on the likelihood ratio. 

2.2. The classification problem 

In the forensic arena the commonest speaker recognition task is to decide how 
hkely it is that two recordings are from the same person. This iicyphes that there 
is the possibihty for two kinds of errors. An innocent individual can be 
convicted, as a consequence of the similarity between the two recordings taken 
into account, or a guitly person can be released due to the dissimilarity. 

The most important information for the decision makers is thus related to the 
measures of the risks related to the two kinds of errors. 

There are judicial cases in Italy where many controversies arose between persons 
involved in the decision process about the proper approach to the decision 
problem. In particular, some cases dealt with it by means of significance tests 
and were highly criticized, due to the fact that the probabihty of fklse attribution 
was not taken into account. Some discussions derived from these cases are 
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reported in Rossi (1996a and 1996b). 

In the present contribution, the decision problem is approached by a bayesian 
decision paradigm, based on hkelihood ratio or Bayes factor of the two 
hypotheses. The sensitivity analysis with respect to the prior probabihties shows 
that, for a wide range of values, the maximum hkelihood approach is feasible. 

The probabihty of false attribution is easily calculated corresponding to different 
values of the prior probabihties, thus providing a complete support to decision 
system for decision makers. 

In any case, the evaluation of the posterior probabihties of the two errors is such 
that any kind of decision seems to be quite critical, due to the uncertainty 
coming from the incomplete information related to the between speaker 
distribution as actuahy its estimation is based on just small samples of voices 
coming from the reference population. 



3. An application to a real data set 

To enter in greater details we are going to use the same data set studied by 
Calvani (1996). 

3. 1 Exploratory analysis 

One of the first issue to be studied is the problem of the ‘dimension’ of the 
speaker space generated by the 16 formants. A preliminary exploratory analysis 
has been conducted using the data about the 100 male ad^t ^eakers described 
in Calvani (1996) by means of principal component analysis using S-plus. 

The correlation matrix of the 16 averaged formants has been calculated, showing 
some rather high correlations. 

In particular, we observe correlation coeflhcients ranging from 0.86 to 0.91 
between the type formants of the four vowels, as it was expected from 
acoustic models, hut also correlation coefficients 0.86 between the {3 formants of 
“a” and “e”, 0.88 between “a” and “o”, 0.85 between “e” and ‘H” and 0.82 
between “e” and “o”. The correlation coefficient between the £2 formants of “e” 
and “i” results 0.81, to mention only correlations above 0.80. 

The results of the principal conq)onents analysis has given the following 
eigenvalues; 5.77 (36%), 3.84 (24%), 1.84 (11.5%), 1.05 (6.6%), 0.93 (5.8%), 
0.74 (4.6%), 0.39 (2.4%), 0.34 (2.1%), 0.32 (2%), 0.2 (1.3%), 0.15 (0.9%), 
0.12 (0.8%), 0.1 (0.6%), 0.09 (0.6%), 0.08 (0.5%), 0.04 (0.3%). We can 
observe that the variabihty related to the first 2 components is 60%, with 7 
components we get more than 90%, and using 9 components we obtain 95%, we 
need 14 components to reach 99%. 

The first eigenvector is essentially a weighted mean of the 16 formants, the 
second is a contrast between the formants (fc,fi) and (f 2 ,fs), the third is a more 
conq)hcated contrast and the fourth is a contrast between the formants ( 5 >,f 2 ) and 
(fi,fs), the first 4 components are linked to a total variabihty of 78%. 

The scatter plots on the eigenvector planes show sparse (independent) 
configurations (figure 1). 

This suggests using the principal components as independent coordinates of the 
speaker in the speaker space. In this framework, the easiest scalar distance index 
is the euchdean distance between the unknown speaker’s coordinates and the 
defendant’s coordinates or any other index based on the difference of the 
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corresponding conq)onents Zi (i=l,2,...16) within and between speakers. Then 
the proper likelihoods can be estimated. As a first step is thus necessary to study 
and compare the distributions of such 16+16 differences. 

3.2. The estimation of the two likelihoods 

Unfortunately, as shown in Paronetto (1995) and confirmed by new exploratory 
analyses, using the Bhattacharaya’s graphic method (1967), in Beciani (1998), it 
is not possible to suggest a sinq)le parametric fai^y to model the formants 
variability, either based on gaussians or gaussian mixtures, and, as a 
consequence, to model the principal conq)onents variability. This means that, as 
aheady suggested in Calvani, it is necessary to proceed using non parametric 
density estimation methods, as the kernel. 

First of all a sample of within and between speaker differences is needed. Then 
the kernel density estimation method can be used, as aheady done in Calvani for 
singular origmal formants. 

The sanq)le of the between speaker indices has been generated by sampling 400 
couples of different speakers and then calculating the 16 differences between the 
Zi. The sample of the within speaker differences has been generated by 
resanq)ling 400 formant vectors j&om speaker “66” data, transfonning them to 
obtain the principal con] 5 )onents, using the matrix obtained by the principal 
components analysis of the 100 speakers, and then calculating the 16 differences 
between the Zi. 

Then the likelihoods have been estimated by kernel method, optimizing the 
smoothing parameter by pseudolikelihood cross- vaUdation applied to each 
coordinate separately. Some results are shown in figure 2 and indicate some 
interesting features. In particular, we can observe that the differences of the 
corresponding principal coroponents of the within (prosecution hypothesis) and 
between (defence hypothesis) ditributions show opposite behaviours with 
respect to their variabihty. Indeed, the between indices have much higher 
variabihty for the first con^onents and lower for the last, while the within 
indices have lower variabihty for the first conq)onents and liigher for the last. 
This suggest using the coordinates separately, without combining them in a 
scalar index, to obtain the maximum efBciency for the classification purposes. It 
must be noted that the overlapping of the distributions of the intermediate 
indices mirrors the “true dimensionahty” of the formants space, aheady analysed 
by calculating the correlation coefBcients. Higher order differences could allow 
to increase the different behaviour of the within and between distributions and 
should be studied more in dept. 



3. 3. The decision problem 

Once the likelihoods of appropriate indices, corresponding to the two alternative 
hypotheses, have been estimated, the decision (classification) problem can be 
easily dealt with by means of Bayes factors, as also suggested in Aitken (1995). 
Thus, in the real cases, using the proper indices and likelihoods, Bayes factors 
can been calculated for various speakers and several values of the distance 
indices based on the differences of the principal conqionents. Different tables can 
then be made available and can be used as a valuable basis to evaluate posterior 
probabilities and classification errors in forensic cases, once the proper within 
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speaker likelihood has been calculated for the actual defendant. The conq)lete 
tables and results for the actual case (defendant assumed to be speaker “66”) will 
appear in a fixture paper (Rossi, in preparation). We must stress now that the 
present approach is the only one that allows to properly calculate the 
classification errors and the probative value of any evidence, while, again, we 
must point out that the approach to speaker identification through tests of 
significance, used with the aim to accept the null hypothesis and not to reject it, 
as in Ibba and Paoloni (1993) and by the same authors proposed in some actual 
forensic cases, is a nonsense, it does not allow to calculate the identification 
error probabihty, and may more easily produce the conviction of an innocent 
person. 



4. Conclusions and further developments 

Advances in technology will continue to bring gradual progress in the area of 
technical speaker identification in the forensic context, but the advances will 
highly depend on a more complete model of ‘speaker space’ becoming available. 
That is, if we assume a k-dimensional space defined by acoustic and phonetic 
parameters which can be analysed, we need to study fixrtherly what are the 
relevant statistics for separating speakers and, with respect to the reference 
population, to what extent the region in the space occupied by one speaker 
overlaps the regions occupied by other speakers. The research wUl have to take 
advantage of databases containing samples from larger numbers of speakers. 
Some data coming from Itahan Scientific Pohce laboratory are presently 
available to be studied and the results of the statistical analyses are expected in 
few months. An important aspects is also connected to efficient dissimilarity 
indices to be used. 

An extensive study of the pecuhar discriminant features of different indices is 
going on presently using the two different data sets: the first is the same data set 
described in Calvani (1996) and the second is the data set coming from the 
Itahan Scientific Pohce laboratory that is extracted using a different method with 
respect to the other one, so that also pecuharities related to the measurement 
method can be studied. 

All these analyses are designed with the aim to reduce the probabihties of the 
classification errors that may result either in the conviction of an innocent or in 
the release of a guilty person. Thus, the final result of the studies should be a 
reduction of the expected social costs deriving from these specific forensic cases, 
connected to the speaker identification problem. 
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Figure 1: Scatter plots of some couples of principal components 




Figure 2; Estimated densities of within and between differences of some 
principal components (w=within, b -between) 





Ordinal Variables in the Segmentation 
of Advertisement Receivers 
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Abstract: The paper presents segmentation study, which employs methods of 
classification to single out the segments. The variables measured on the ordinal 
scale were used as the criteria of market segmentation. Variables used reflected 
students’ attitude towards 20 statements about advertising. Ordinal character of 
the data required application of specific measure (1) of object’s distance. This 
measure was used in order to evaluate the similarities of objects, which were 
based on numbers of relations “equal to”, “greater than” and “smaller than”. 

Keywords: Classification, Measurement Scales, Market Segmentation. 



1. Market segmentation 

Pioneer article of Smith [1956] caused that segmentation appeared as widely 
researched and applied attitude to marketing. Almost each enterprise observes 
heterogeneity of its market - that is why the division of consumers into homo- 
geneous sub-groups seems to be promising approach. Due to that fact enterprise 
may expect increase of profitability of its market operations. Theoretical justifi- 
cation of what had been expected, was published in the work of Frank, Massey 
and Wind [1972]. They proved that the expected value of obtained profitability 
level is higher with the diversified approach to the market. This caused quick 
acceptance of segmentation by both managers and specialists of marketing re- 
search. 

Marketing strategies are constructed on the results of the undertaken seg- 
mentation. Similar tendencies are observed in Poland. Multinationals which 
operate on Polish market appreciate the role of diversification of marketing 
strategies according to identified market segments more than the others. 

There are two most fi-equently employed attitudes towards segmentation of 
practical market researches (see Green [1977]): 

- segmentation a priori, 

- segmentation post hoc based on classification procedures. 

In a post hoc model of segmentation - where classification methods are used 
differs from the a priori model only by the way the base variable was chosen for 
segmentation. Instead of arbitrary choice of segmentation base (in the a priori 
approach), a variety of consumers characteristics are used in the post hoc ap- 
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proach. Additionally, in the next step the classification procedure where neither 
number of classes nor their descriptive characteristics are known is used. The 
most frequently used variables for the segmentation study are those identifying 
needs, attitudes, lifestyle and psychographic characteristics of the consumers. 
As the result homogeneous consumer groups are isolated and identified. Factor 
analysis, which aims at reducing the set (number) of diagnostic features often, 
precedes class isolation process. 

The authors of overviewing works by Frank and Green [1968], Saunders 
[1980], Punj and Stewart [1983], Beane and Ennis [1987] and Arabie and 
Hubert [1996] pointed out main application fields for classification methods in 
marketing, mainly in market segmentation. 

Empirical segmentation studies, which have been dominated by two men- 
tioned above approaches (a priori and post hoc), are supplemented by fuzzy and 
stochastic classification, regression analysis, and conjoint analysis. 

Application of classification methods requires formalisation of the “similar- 
ity of objects” term. The use of particular construction of similarity measure 
depends on the scale on which the variables were measured. The choice simi- 
larity measures is easy when all the variables describing examined objects are 
measured on one type of scale. Literature presents plenty of different ways of 
similarity measurement which can be adopted to variables measured on scale: 
ratio, interval and (or) ratio, nominal (including binary variables). Wide range 
of similarity measures has been shown in the works of: Cormack [1971]; An- 
derberg [1973]; Everitt [1974]; Kaufman and Rousseeuw [1990]. 

Walesiak [1993], p. 44-45, gives the proposal of the new measure of objects 
similarity, which can be applied in situation when variables describing those 
objects are measured only on the ordinal scale. In the construction of the meas- 
ure under consideration, Kendall’s idea of correlation coefficient r for ordinal 
variables (see Kendall [1955], p. 19), was used. There is given non-empty set A 
of objects, each of them is described by m ordinal variables. Because of the fact 
that on the ordinal scale the only possible empirical operation is counting of 
events (which means: allotting the number of relations “equal to”, “greater 
than” and “smaller than”), construction of distance measure in the following 
form is being proposed: 
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where: 

1, if x,j > Xpj (x^j > 

0, if x^j=xpj(x^j=x^j), (2) 

-1, if x,j <x^j(x,j <x^j) 

p = k 1; r = i, 1; i, k, I =1, ..., n - number of object, 
j = I, m- number of ordinal variable, 

x-j (Xf^j,Xij) - i-th (A:-th, /-th) observation on y-th ordinal variable, 




j=l y=l '=1 

l^ijc 



number of relations “greater than” and 
“smaller than” defined for object /, 



7=1 7=1 /=1 

l*ijc 



number of relations “greater than” and 
“smaller than” defined for object k. 



A distance measure satisfies conditions: >0, =0, (for 

all /, k= 1 ,..., n) and does not always satisfy the triangle inequality (based on 
simulation analysis). Transformation of ordinal data by any strictly increasing 
function does not change the value of distance (1). 

Distance assumes values from the [0; 1] interval. Value 0 indicates that 
for the compared objects i, k between corresponding observations of ordinal 
variables, only relations “equal to” take place. Value 1 indicates that for the 
compared objects /, k between corresponding observations on ordinal variables, 
relation “greater than” take place or relation “greater than” and relations “equal 
to” if they are hold for other objects (i.e. objects numbered / = 1, n\ where 
/ k). 

Discussion about choice of hierarchical classification methods has been 
shown in the work of Gordon [1987]. In marketing and other behavioural sci- 
ences, the most popular hierarchical agglomerative method is complete linkage 
(see Arabie and Hubert [1994], p. 171). The method of complete linkage has 
been adopted for the segmentation. This method appears as one of hierarchical 
agglomerative methods which doesn’t show the tendency to create the classes of 
chain. Method of complete linkage avoids the disadvantage of other hierarchical 
methods which show the tendency to incorporate the new objects into the ex- 
isting groups. New classes are rarely established. As the result, the possibility of 
presence of dissimilar objects in one class occurs. The distance given by the 
formula (1) has been adopted as the similarity measure. 
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2. Segmentation of advertising responders - case study 

A group of 244 students of Wroclaw University of Economics were questioned 
in order to isolate the groups (segments) which similarly reacted to certain 
statements about advertisements. The experiment has been conducted with the 
plan to use the classification methods (a post hoc approach). The questionnaires 
were done in the 1996/1997 academic year. Segmentation was based on 20 
statements about advertising: 

1 . Information coming from the advertisement helps me to make a decision. 

2. I don’t believe such an advertisement which emphasises higher product 
quality than its competitors. 

3. I usually decide to buy advertised product which is accompanied by ad- 
ditional bonuses. 

4. Without advertisements no product can be accepted by wide range of 
consumers. 

5. In-store promotion has limited influence on my purchasing decision. 

6. Advertising is a good source of information. 

7. Advertising tempts people to spend their money foolishly. 

8. If advertising were stopped, the customer would be better off 

9. Generally advertisements annoy me. 

10. The most effective are TV commercials. 

1 1. On the whole, advertising is believable. 

12. 1 enjoy television commercials. 

13. In my opinion advertisements stronger the decision of purchases of eld- 
erly people. 

14. Information of advertised products saves my precious time. 

15. 1 usually decide to buy newly launched and advertised product. 

16. 1 think, that legal regulations are necessary to limit the advertisement in- 
fluence on consumers. 

17. My brand loyalty could not be changed even by exceptionally good ad- 
vertisement of competitive brands. 

18. Most of the advertisements are based on the credulity of consumers. 

19. Before purchasing of newly advertised product I usually seek for my 
friends’ advice. 

20. Advertisement is a sign for me that the product should not be purchased. 
The questioned students were asked to mark the category on the five-degree 

ordinal scale, which reflected their attitude towards the given statement: 5 - 
Strongly agree; 4 - Agree; 3 - Neither agree nor disagree; 2 - Disagree; 1 - 
Strongly disagree. 

For the purpose of this study, the hierarchical agglomerative methods were 
used to classify students into relatively homogeneous clusters. The complete 
linkage method proved to be most suitable, because it avoids chain effect. Dis- 
tance (1) was assumed as the measure of similarity of objects. The optimal 
number of segments should be between 10 and 5. Due to the fact that the last 
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class has too few objects to consider it a segment, finally seven segments have 
been established. 

In a post hoc segmentation, very often the respondents attitudes measured on 
the Likert scale is used as the segmentation base. In order to establish segments 
profiles the additional respondents characteristics are analysed (see Green, Tull 
and Albaum [1988], p. 691-693). The chosen demographic, geographic and 
economic characteristics are presented in the Table 1. 



Table 1 : Characteristics of market segments 



Variables 


Segments | 


I 


II 


III 


IV 


V 


VI 


VII 








22/11 


9/14 






13/12 


Average number 
of household members 


3.82 


3.67 


3.85 


3.70 






3.60 




1.67 


1.51 


1.87 


1.58 




m 


1.67 






4.04 


4.32 


3.02 


3.35 


^^9 


3.58 


Median of monthly income being at 
student’s disposal (in Polish Zlotys) 


184.4 






158.3 


175.0 






Place of residence during studies 
(% of students) 


1 

2 

3 

4 


51.1 

35.6 

4.4 

8.9 


47.8 

28.3 
6.5 

17.4 


33.3 

48.5 

3.0 

15.2 


30.4 

39.1 
4.3 

26.1 


41.2 

41.2 

5.9 

11.8 


32.1 
43.4 

13.2 

11.3 




Permanent place of residence 
(% of students) 


A 

B 

C 

D 


62.2 

6.7 

20.0 

11.1 


39.1 

17.4 

32.6 

10.9 


57.6 

6.1 

27.3 

9.1 


43.5 

4.3 

34.8 

17.4 


35.3 

0.0 

41.2 

23.5 


52.8 

18.9 
20.8 

7.5 


i 



Key: 

1 - student’s hostel 

2 - parents’ flat 

3 - own flat 

4 - rented room 



A - town with more than 100,000 inhabitants 
B - town with 50,000 to 100,000 inhabitants 
C - town with less than 50,000 inhabitants 
D - village 



Segment I treats advertisement as a source of potential benefits. Advertise- 
ment for them is an information source of the product and supports their deci- 
sion of purchasing. It is dominated by women (80%) mainly been accommo- 
dated at Students’ Hostels or living with parents and coming from the cities of 
above 100 thousand inhabitants. This cluster has the highest median of monthly 
income. 

Segment II avoids extreme statements when evaluating (they prefer moder- 
ate attitudes). 

Segment III shows the aversion to risk when buying advertised products 
they consider fiiends’ advice. They admit that advertisements stronger affect 
elderly people. TV commercials are considered as the most powerful decision 
factor. This conclusion may come from radio listening habits. Median of 
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monthly income is the lowest among all segments. Women represent 66.7% of 
this cluster. 

Segment IV shows strong brand loyalty. Men represent 60.9% of this seg- 
ment. The representatives of this cluster spend relatively less time on radio lis- 
tening. More than 26% rent a room. 

Segment V is the least numerous. It includes the small number of respon- 
dents. On one hand they declare reluctance to advertisements but on the other 
hand they seek for the source of potential benefits in advertisement. Men repre- 
sent 64.7% of this cluster. This cluster devotes most of its time to TV watching. 
The highest percentage of them are the flat owners. 

Segment VI does not look for the source of future benefits. It shows the re- 
luctance to advertisements. It devotes relatively less time to TV watching and 
radio listening. 

Respondents of segment VII do not trust advertisements showing at the 
same time. Strong brand loyalty the highest percentage wish relation to other 
segments rents a room. 



3. Concluding remarks 

The use of segmentation variables measured on ordinal scale is relatively rare in 
the literature. The specific analytical tools are needed for such information. Im- 
portant aspect of this study was formulation, in the methodological sense, cer- 
tain segmentation procedure which implements ordinal variables as a consumer 
description criteria. 

In the empirical study seven segments have been established. The profiles of 
even segments were analysed. Moreover, this segmentation study may be re- 
garded also as a contribution to evaluation of students’ reactions as peculiar 
social group of advertisement receivers. 

Additional result of this study is computer program, which allows computing 
distances between objects (see Appendix). 



Appendix 

The computer code in the C++ language computing the value of measure (1) of 
the distance considered is available at Wroclaw University of Economics in the 
Dept of Econometrics and Computer Science (e-mail: abak@keii.ae.jgora.pl). 

Computational algorithm of distance (1) presented below, omits details of 
enter and exit operations. The complete computer programme uses files in DBF 
format, which serve both as the data source and the result files. The Borland® 
C/C++ V. 4.0 compiler was used for the programming and the CodeBase++™ 
V. 5.1 library procedures served as a database programming language. As the 
result of its application the distance matrix is generated. This matrix may be 
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used in the hierarchical agglomerative methods of the classification for division 
of set of objects into classes. This matrix can be used also for further computa- 
tions in the SPSS for Windows package. 

procedure dist_mw{X, Y, m, n) 

{ Computes distance matrix between objects according to the measure (1). data matrix (rows 
- objects, columns - variables measured on the ordinal scale); Y - symmetric distance matrix; 
m - number of variables; n - number of objects } 
var ij, k, /, a, , , b^: integer^ 5, , 5j , 5^ , d\ real, 

begin 

for / <— 1 to n do begin 
for ^ + 1 ion do begin 

5j < — < — 0 ; 

for y <- 1 to m do begin 

if Jr[/,y]> ^[A:,y] then a, <— 1 else if x[/,y]< x[^,y] then < — 1 else <— 0 ; 

if X\kj\> X^,f\ then <r- 1 else if x[A:, y ] < y ] then b^ < — 1 else <- 0 ; 

5, <- 5, + a, * <- 5, + a ] ; 5^ <- + b ^ ; 

end 

^ — iSj ^ ^ — 0 ; 

for / <- 1 to « do begin 
a I /\l then begin 
for y <- 1 to m do begin 

if ^[/,y]> Jf[/,y] then <- 1 else if x\i,j\< x\l,j\ then — 1 else <- 0 ; 
if Jf[A:,y]> x[/,y] then b^ <— 1 else if x[A:,y]< x[/,y] then b^ < — 1 else <— 0; 

* ^2 » *^5 ^ \ -^6 ^ -^6 ^2 i 

end 

end 

end 

d<-0.5- (s, + jj )/(2 ♦ ,/(s, + 55 X 54 +^ 6 )); k[*. i]*-d\ y[z, k]^d\ 
end 
end 
end 
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1. Introduction 

‘Electronic surveys’ refer to several kinds of surveys, including Internet, on- 
line, Web-based and electronic mail (e-mail) surveys, and a lot of these new sur- 
veys are being conducted in business and research areas, such as marketing re- 
search, consumer behavior, etc. According to the order of historical arrival, and 
depending on how the questionnaires are represented and how the respondents’ 
answers are collected, these survey methods can be classified into categories 
such as e-mail, specific systems designed for limited use and Web-based sur- 
veys. Especially, in the case of Web-based surveys, the questionnaire sheet is 
set up as a Web page. The respondents access this electronic page and answer 
the questions. The method of experimental surveys utilized in this paper is the 
Web-based survey. 

Electronic surveys differ greatly from conventional surveys, and computers and 
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network environments play an important role in designing the survey methods 
and data acquisition systems and in statistically analyzing the results obtained. 
However, technological capability in the fields of computer and Internet envi- 
ronments exceeds progress in practical, concrete survey methods, and as a re- 
sult, despite careful study of such fundamental aspects as representativity, re- 
producibility and reliability, it seems that further systematic research concern- 
ing the actual survey methods are still needed. In this study, we conducted a 
series of experimental surveys designed to be able to compare conventional and 
electronic surveys from a methodological viewpoint. The main purpose of the 
report is to clarify the relationship between the conventional approaches and 
the more recent survey methods in such aspects as the design of a sampling 
survey, including sampling methods, how to construct the questionnaire sheet 
on Web pages, and actual survey procedures. 



2. Outline of Experimental Surveys 

By receiving the cooperation of the staff of Recruit Research Co., we first es- 
tablished a panel of 2,036 registered samples. Table 1 shows the panel’s demo- 
graphics and the percentages. 



Table 1 Panels' Demographics and Their Component Ratio (N=2,036j 



Sex 




(Married 


Unmarried) 


Females 


(Married 


Unmarried) 






(49.1 


50.9) 


39.7 


(38.0 


62.0) 


Age Group 


Teens 


Twenties 


Thirties 


Forties 


Fifties 






2.4 


53.3 


37.1 


6.0 


1.2 




Vocation 


Corporate 


Civil Ser- 


Self- 


Freelancers 


Students 


Unem- 




Workers 


vice and 


employed 


and Part- 




ployed or 






Teachers 




timers 




Others 




65.5 


4.1 


3.4 


5.4 


13.6 


8.0 



In Table 1, we see that nearly 80% of the registered samples are office workers 
and students, and also that their age brackets are concentrated in the twenties 
and thirties. Table 1 also indicates a tendency often observed in this kind of 
survey, that is, that there are fewer females in the sample than males. Howev- 
er, in our survey, the ratio of females was higher than in other surveys. 

On the panel, twelve Web-based surveys were conducted in all. In each sur- 
vey, the opinions were collected by following procedures: (1) The panel mem- 
bers were informed by e-mail of the URL of the questionnaire pages and their 
cooperation in the surveys was requested. (2) Each member of the panel ac- 
cessed the questionnaire pages and answered the questions shown on the dis- 
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play screen. (3) The tabulated results were obtained by gathering the respon- 
dents' answers and merging them together with their demographics. (4) Some 
incentives were given to the respondents, decided by drawing lots. It must be 
noted that because of some troubles or other reasons, the number of actually e- 
mailed members of the panel were less than 2,036 in all surveys. After this, we 
will call the e-mailed member of the panel ‘scheduled sample’. 

The themes of the surveys were various, such as ‘Drinking and smoking experi- 
ence,’ ‘Use of Internet’s service sites,’ ‘Food,’ ‘Consumer behavior,’ etc. We 
have provided the 3rd, 6th, 7th, 10th and 12th surveys with our original ques- 
tions to study the construction of the questionnaires. We expressly describe 
more details of the 6th, the 10th and the 12th surveys below. 

Our 6th and 12th surveys are designed to clarify the characteristics of our new 
method by asking questions with the definite results in conventional sampling 
surveys available and by asking the same questions repeatedly. The contents 
of the questions asked were ‘Attitudes towards life,’ ‘Japanese people's na- 
tionality,’ ‘Japanese people's images of their top leaders,’ etc., and Free-answer 
questions about the contents that was asked commonly in almost all the sur- 
veys. These surveys were conducted in the period July 13 to July 20, 1997, 
and November 8 to November 16, 1997, respectively. Reminders were sent 
once to all members of the scheduled samples during the 12th survey. The 
number of scheduled samples was 2,029 for the 6th survey and 2,009 for the 
12th survey. We received 1,129 complete responses for the 6th survey and 
1,157 for the 12th survey. 

The 10th survey dealt with privacy problems on Internet. Examples of the 
questions were: ‘Of what matters would you hesitate to be informed about an- 
other person?’, ‘What kind of information do you prefer not to give to 
others?’, and so on. The methods used through all the surveys were more or 
less the same except for the kinds of incentive. The survey period was from 
September 18 to September 24, 1997, and the number of complete responses 
was 954. 



3. Main Findings 

3.1 Some distinguishing features through the total series of surveys 
Fluctuation of completion rate The average of completion rates of all the 12 
surveys (54. 1%) was higher than we had at first estimated, and there was a gen- 
eral descending tendency in completion rate as the surveys go on. Moreover, 
there is a little fluctuation in the completion rates through all the surveys. 

In the 12th survey, we asked which surveys ‘you are interested in,’ ‘you feel 
operationally troublesome’ and ‘you think the contents are difficult to answer.’ 
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Table 2 shows the percentages of the choice and the completion rate of each 
survey. It is needless to say that the cause of the fluctuation of completion 
rate cannot be easily determined, but the result shown in Table 2 suggest the 
possibility that the theme of each survey, complexity of layouts and difficulty 
of questions were a major cause of the fluctuations, and the opinions stated for 
the open-ended questions also support this. 



Table 2 Completion Rate and Impressions of Each Questionnaire 



Survey 


1st 


2nd 


3rd 


4th 


5th 


6th 


7th 


8th 


9th 


10th 


11th 


12th 




70.5 


57.2 


57.0 


59.0 


54.1 


55.6 


50.9 


51.7 


48.3 


47.4 


40.2 


57.6 


Interested in the 
theme (%) 


37.9 


57.9 


35.2 


46.1 


31.2 


34.7 


15.4 


18.2 


28.8 


44.6 


14.5 


- 


Troublesome to 
answer (%) 


18.6 


15.8 


6.0 


6.5 


4.2 


3.1 


12.4 


29.1 


31.4 


5.7 


27.9 


— 


Difficult to an- 
swer (%) 


9.8 


8.1 


5.9 


21.2 


5.8 


14.2 


42.2 


12.2 


12.2 


21.5 


19.4 


— 



Table 3 The distribution of Participation Frequency 



Frequency 


0 12 3 


4 5 6 


7 8 


9 


10 


11 


12 


Total 

number 


Number of 
participants 


288 138 140 100 


89 115 120 


144 150 


206 


215 


246 


"ii" 


2,036 



Participation frequency Table 3 shows the distribution of access frequency 
to the Web pages. It is interesting that as many as 288 registered members 
(14. 1%) never answered any questions. This means that quite a few aimed for 
registration only or only looked at the Web home page without making any re- 
sponse. On the other hand, 85 monitors (about 4.2%) responded to all the sur- 
veys. The average was about 6.0 times (including no participated monitors), 
that means that they made a response to half of the surveys. It should be 
noted that not a few of the registered members (about 37%) accessed to the 
Web pages more than 9 times. 

The answer rates to the open-ended questions In all of the surveys we set 
up ffee-answer questions as the last item. Those who filled in this section were 
667 out of 1,129 (59.1%) as for the 6th survey, and 489 out of 954 (about 
51.3%) as for the 10th survey. To the questions of the 6th survey: “What is 
the most important?” and “What is the second most important?”, 1,119 and 
1,061 out of 1,129 participants wrote the answers (99.1% and 94.0%, respec- 
tively). This seems unusually a lot. We may well say that those interested in 
these surveys are active enough not to hesitate to fill in. 

3.2 Comparison with conventional method and response uncertainty 
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Comparing with the definite results by conventional method, it was found that 
our respondents were, in some points, very conservative, but on the other hand 
they seemed very sensitive about their life. It is important that, in electronic 
survey, another reliable survey results are needed as comparative indices in in- 
terpreting the result. 

We used almost the same questions for the 6th and the 12th survey. In the re- 
sults, generally, we saw a similar tendency as that of the conventional surveys, 
that is, some inconsistency were observed in between the 6th and the 12th 
marginal response distributions. On the other hand, there were some questions 
that showed interesting distributional differences in the cross tabulation of the 
repetition, despite few differences between the marginal distributions. For ex- 
ample, in the question ‘Consensus versus own principle’ over 20% of the re- 
spondents oppositely changed their opinion between the two surveys (Table 

4 ). 



Table 4 The Uncertainty of Response 





Principle 


Consensus 


N.A. 


Total 


Principle 


29.9 


11.5 


0.0 


41.4 


Consensus 


12.2 


45.9 


0.1 


58.2 


N.A. 


0.2 


0.1 


0.0 


0.3 


Total 


42.3 


57.5 


0.1 


100.0 



Note: The stab is the 6th and the banner is the 12th. 

‘Consensus versus own principle’ 

‘'Which of the two people described on this card would you like best? 

Just read off the letter 

1 : A person who stresses his/her own principles rather than 
achieving a consensus among other group members. 

2; A person who stresses the importance of achieving a consensus among 
other group member rather than maintaining his/her own principle. 

3.3 Privacy on Internet 

In the 10th survey, we asked questions about privacy problems on Internet and 
the distribution of information on the Web. Mainly, the followings are the 
characteristics we grasped: Many respondents think their income, address, and 
phone number as their skeleton privacy. They don't think much of the secrecy 
of sex, age, vocation, e-mail address. The overwhelmingly many reasons for 
not answering are: ‘It's not clear for what my answer will be used. ’ and ‘It's not 
worth while answering.’ When the respondents were asked to input their own 
information, many of them made no answers. Those who input inaccurate an- 
swers intentionally were few. Most of them were active in answering, but 
some of the options seemed to be difficult for them to answer. 
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4. Discussion 

One of the conclusions is numerical values to compare are needed in interpret- 
ing the results of tabulation in electronic surveys. Moreover, it is important to 
design a panel study for the same respondents and make it possible to repeat 
the surveys with similar or the same questions in them. This is all natural, but 
indeed, we can find no electronic survey which fully take account of the inter- 
pret ability and consistency of responses. 

On the other hand, we were also able to get different kinds of information. For 
example, there were so many highly experienced web users in our surveys. 
This raises the possibility that the same persons may be participating in differ- 
ent surveys. We may conclude that the population size of the respondents on 
Internet is limited and the chance that the same persons are frequently captured 
is very high. Different surveys may being conducted repeatedly on a group 
with a fixed number of members. There were also a considerable number of du- 
plicated answers and registrations observed. To prevent such applications, we 
could develop a system getting rid of the answers by the same persons , but 
such a system might be unable to give opportunities of correcting mistakes. 
Although people think electronic surveys have a great advantage of real time 
processing, in fact, there are too many things for us to do: identification and ac- 
knowledgments of the respondents' answers, maintenance of data archives, 
checking of the record contents, and immediate processing. Moreover, although 
all members of the registered panel had been informed of our purposes at the 
beginning of the surveys, nevertheless many respondents wrote in the spaces 
for free answers such questions as ‘Why do you conduct such a survey?’, and 
so on. We think it is indispensable to ask the respondents' opinions concerning 
the release of such information in public, and also give the results back to them 
as soon as possible. Such actions would help raise awareness of issues related 
to the administration of surveys. 
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